CHAPTERS 


Integration 


SECTION 15. THE INTEGRAL 


Expected values of simple random variables and Riemann integrals of contin- 
uous functions can be brought together with other related concepts under a 


general theory of integration, and this theory is the subject of the present 
chapter. 


Definition 


Throughout this section, f, g, and so on will denote real measurable 
functions, the values + allowed, on a measure space (Q, F, u). The object 
is to define and study the definite integral 


[fdu= J flo) du(w) = f f(w)u(de). 


Suppose first that f is nonnegative. For each finite decomposition {A,} of 
Q into Asets, consider the sum 


‘int ((«)| (AD, 


(15.1) es 


In computing the products here, the conventions about infinity are 


o=o-0=0, 
(15-2) x'%=0'%y=% if0<x<a, 
o0 * 00 o0 


‘Although the definitions (15.3) and (15.6) apply even if f is not measurable F, the proofs of 
most theorems about integration do use the assumption of measurability in one way or another. 
For the role of measurability, and for alternative definitions of the integral, see the problems. 


_ 


200 INTEGRATION 


The reasons for these conventions will become clear later. Also in force are 
the conventions of Section 10 for sums and limits involving infinity; see (10.3) 
and (10.4). If A; is empty, the infimum in (15.1) is by the standard convention 
œ; but then u(A;)= 0, so that by the convention (15.2), this term makes no 
contribution to the sum (15.1). 

The integral of f is defined as the supremum of the sums (15.1): 


(15.3) [fav = sup int Fo) | H(A). 


The supremum here extends over all finite decompositions {A,} of into 
sets. 
For general f, consider its positive part, 


d (w) ifO0<f(w)<%, 
(15.4) Mes b if —~<f(w) <0 


and its negative part, 


if -æ <f(w) <0, 


(15.5) f(@)= be ii if 0 < f(w) <œ. 


These functions are nonnegative and measurable, and f = f*— f . The gen- 
eral integral is defined by 


(15.6) [fdu= Jf du-[f du, 


unless [f* du = [f du =~, in which case f has no integral. 7 

If [f+ du and [f~ dy are both finite, then f is integrable, or integrali ii ; 
or summable, and has (15.6) as its definite integral. If [f+ du =œ% and 
Jf du < œ, then f is not integrable but in accordance with (15.6) is assigned 
æ% as its de integral. Similarly, if [f* du < and Jf- du =~, then f is 
not integrable but has definite integral — œ. Note that f can have a definite 
integral without being integrable; it fails to have a definite integral if and only — 
if its positive and negative parts both have infinite integrals. 

The really important case of (15.6) is that in which /f* du and [f dm are 4 
both finite. Allowing infinite integrals is a convention that simplifies the r 
statements of various theorems, especially theorems involving nonnega ive k: 
functions. Note that (15.6) is defined unless it involves “œ — œ”; if one term 
on the right is 2% and the other is a finite real x, the difference is defin 
the conventions œ% — x = œ and x — œ% = —o, po 

The extension of the integral from the nonnegative case to | 
case is consistent: (15.6) agrees with (15.3) if f is nonne; i$ 


f-=0. 


SECTION 15. THE INTEGRAL 201 


Nonnegative Functions 


It is convenient first to analyze nonnegative functions. 


Theorem 15.1. (i) If f=%,x;1, is a nonnegative simple function, {Aj} 
being a finite decomposition of Q into F-sets, then [fdp = X,x,pCA)). 

(ii) If 0<f(w) <g(w) for all w, then [fdu < f[gdp. 

(iii) If O<f,(@)t flw) for all w, then 0 < ff, dut fd. 

(iv) For nonnegative functions f and g and nonnegative constants a and PB, 
[laf+Bg)du=affdu + Bifgdp. 


In part (ii) the essential point is that /fdu=lim, ff, du, and it is 
important to understand that both sides of this equation may be ~. If f, = 14, 
and f=I1,, where A, 7 A, the conclusion is that u is continuous from below 
(Theorem 10.2(i)): lim, n(A) = u( A); this equation often takes the form 


co = 00, 


PROOF OF (i). Let {B;} be a finite decomposition of Q and let 8; be the 
infimum of f over B, If A,B, #@, then B; <x,; therefore, L,B,u(B,) = 
L; B; aA NOB) < Ex OT AB, JHR AA, ). On the other hand, there 
is equality bere if {B,} coincides maith {A}. a 


PROOF OF (ii). The sums (15.1) obviously do not decrease if f is replaced 
by g. a 


ProoF oF (iii). By (ii) the sequence ff, du is nondecreasing and bounded 
above by /fdw. It therefore suffices to show that [fdu < lim, ff, du, or that 


(15.7) lim Jf, ESE Mes 
i=1 

if A,,...,A,, is any decomposition of Q into Asets and v;= inf, e 4, f(o). 

In order to see the essential idea of the proof, which is quite simple, 
suppose first that S is finite and all the v; and y(4,) are positive and finite. 
Fix an e that is positive and less than. each v; and put A; = [lw EA; 
f,(w) >v, —€]. Since f,,t f, Aj, t A;- Decompose into Arr Amn and 
the complement of their union, wl observe that, since u is continuous from 
below, 


(15.8) [fn du > T (v= e)an) > È (v: — €)u(A;) 


i=1 


gh hers Daa 


i=] 


Since the «(A ,) are all finite, letting e > 0 gives (15.7). we 


202 INTEGRATION 


Now suppose only that S§ is finite. Each product v,u(A,) is then finite; 
suppose it is positive for i<m, and 0 for i> mọ. (Here my <m; if the 
product is 0 for all i, then S = 0 and (15.7) is trivial.) Now v; and w(A;) are 
positive and finite for i < my (one or the other may be © for i > mo). Define 
A;n as before, but only for i < mọ. This time decompose 0) imo A maea s A 
and the complement of their union. Replace m by mọ in (15.8) and complete 
the proof as before. 

Finally, suppose that S = œ. Then Vel A) = = œ% for some ig, so that v, 
and (A; ) are both positive and at least one is œ. Suppose 0 <x <u, <% 
and 0 <y <ul; =e and put An [o EA: f,(w)>x]. From f, PT 
follows A; n 1 A;,; hence MA jn) >Y for n exceeding some no. But then 
(decompose. Q into Ain and its complement) ff, du >xulA,,,) =x for 
n > ng, and therefore lim, ff, du = xy. If v; =~, let x >œ, and if uA; ) = 
œ, let y > ©. In either case (15.7) follows: lim n Ifan d = @. a 


ProoF oF (iv). Suppose at first that f=L,x,J,, and g=2L jy, are 
simple. Then af + Bg = Lax; + Bylin Bp Pak so 


[(af+Be) du = 2 (ax; + By;)u( A;N B;) 
=a} xu(4;) +B} yu(B;)=affdu +B f gdp. 


Note that the argument is valid if some of a, B, x;, y; are infinite. Apart from 
this possibility, the ideas are as in the proof of (5.21). 

For general nonnegative f and g, there exist by Theorem 13.5 simple 
functions f, and g, such that 0<f, 1 f and 0<g, 1 g. But then 0 <af, + 


Bg, taf+ Bg and f(af, + Bg,)du =a/f, du + Bg, du, so that (iv) follows 
from (iii). a 


By part (i) of Theorem 15.1, the expected values of simple random 
variables in Chapter 1 are integrals: E[ X]= [X(w)P(dw). This also covers 
the step functions in Section 1 (see (1.6)). The relation between the Riemann 
integral and the integral as defined here will be studied in Section 17. 


Example 15.1. Consider the line (R',.#',A) with Lebesgue measure. 


Suppose that —% <a) Sa, < --: <a, <œ, and let f be the function with 
nonnegative value x; on for 1» 4;),i= i „m, and value 0 on (—~, ag] and 
(a,,,%). By part (i) of Theorem 15. l, AA p” ıx;(a;— a;_,) because of the 


convention 0 - œ = 0—see (15.2). If the “area under the curve” to the left of fi 
a, and to the right of a,, is to be 0, this convention is inevitable From 
œ-0=0 it follows that {fd =0 if f is œ at a single point (say) and 
elsewhere. 


SECTION 15. THE INTEGRAL 203 


If f=1,..), the area-under-the-curve point of view makes [fdu = œ 
natural. Hence the second convention in (15.2), which also requires that the 
integral be infinite if f is © on a nonempty interval and 0 elsewhere. a 


Recall that almost everywhere means outside a set of measure 0. 
Theorem 15.2. Suppose that f and g are nonnegative. 


(i) If f=0 almost everywhere, then [fdu = 0. 

(ii) If ulo: f(w) > 0)> 0, then ffdu > 0. 
(ii) If [fdu <œ, then f < œ almost everywhere. 
(iv) If f < g almost everywhere, then [fdu < fgdu. 
(v) If f =g almost everywhere, then /fdu = fgdu. 


Proor. Suppose that f = 0 almost everywhere. If A, meets [w: f(w) = 0], 
then the infimum in (15.1) is 0; otherwise, u(4;) = 0. Hence each sum (15.1) 
is 0, and (i) follows. 

If A. =[w: flw) >e], then A, t[w: f(w)> 0] as e0, so that under the 
hypothesis of (ii) there is a positive e for which u(A.)> 0. Decomposing Q 
into A, and its complement shows that [fdu > eu(A.)> 0. 

If ulf =œ] > 0, decompose Q into [f= œ] and its complement: [fdp > 
oo - u[ f = œ] = œ by the conventions. Hence (iii). 

To prove (iv), let G=[f <g]. For any finite decomposition {A,,..., An} 
OF (22 

y 


inff |a(A,) =} [igts w(A,NG)<¥ 
A, A, 


AY, 


where the last inequality comes from a consideration of the decomposition 
A,NG,...,A,,0G,G*. This proves (iv), and (v) follows immediately. a 


inf f]u( 4,96) 
AnG 


u(4:NG) < fgdu, 


inf g 
AAG 


Suppose that f=g almost everywhere, where f and g need not be 
nonnegative. If f has a definite integral, then since f*=g* and f =g™ 
almost everywhere, it follows by Theorem 15.2(v) that g also has a definite 
integral and [fdu = fgdu. 


Uniqueness 


Although there are various ways to frame the definition of the integral, they are all 
equivalent—they all assign the same value to /fdu. This is because the integral is 
uniquely determined by certain simple properties it is natural to require of it. 

It is natural to want the integral to have properties (i) and (iii) of Theorem 15.1. 
But these uniquely determine the integral for nonnegative functions: For f nonnega- 
tive, there exist by Theorem 13.5 simple functions f, such that 0 < f, 1? f; by (iii), 
/fdu must be lim,, ff, du, and (i) determines the value of each ff, du. 


204 INTEGRATION 


Property (i) can itself be derived from (iv) (linearity) together with the assumption 
that (1, du =p(A) for indicators l4: CL, x 14) du = Xix; fla, dp = Èt ul A,). 

If (iv) of Theorem 15.1 is to persist when the integral is extended beyond the class 
of nonnegative functions, [fdu must be ((ft—f-)dpu = ff" du — ff du, which 
makes the definition (15.6) inevitable. 


PROBLEMS 


These problems outline alternative definitions of the integral and clarify the role 
measurability plays. Call (15.3) the lower integral, and write it as 


(15.9) f fdu = sup É] inf fw) H(A) 
* i WEA; 
to distinguish it from the upper integral 


sup f(w)|“(4;)- 


wEA; 


(15.10) f fdu = int 


The infimum in (15.10), like the supremum in (15.9), extends over all finite partitions 
{A} of Q into Fsets. 


15.1. Suppose that f is measurable and nonnegative. Show that /*fdu = if ulo: 
flw) > 0) == or if ulw: f(w) >a] > 0 for all a. 


There are many functions familiar from calculus that ought to be integrable but 
are of the types in the preceding problem and hence have infinite upper integral. 
Examples are x 7J(, (x) and x '/*I9 p(x). Therefore, (15.10) is inappropriate as a 
definition of [fdu for nonnegative f. The only problem with (15.10), however, is that 
it treats infinity the wrong way. To see this, and to focus on essentials, assume that 
uD) is co and that f is bounded, although not necessarily nonnegative or measur- 
able F. 


15.2. t (a) Show that 
E | int Soy] H(A) < E int feou 8) 


ee refines {A;}. Prove a dual relation for the sums in (15.10) and conclude 
that 


(15.11) f fdu < f’ fap. — 


(b) Now assume that f is measurable FY and let M be a pli = Ifi 
Consider the partition A; = [w: ie < f(w) <(i+ Ve], where i fre - 


SECTION 15. THE INTEGRAL 205 


to N and N is large enough that Ne > M. Show that 


z| sup f(o)] mA) ~ E | int o) aC A,) sen), 


wEA; 


Conclude that 
(15.12) J fan =f fdu. 


To define the integral as the common value in (15.12) is the Darboux—Young 
approach. The advantage of (15.3) as a definition is that (in the nonnegative 
case) it applies at once to unbounded f and infinite u. 


15.3. 3.2 15.21 For A CQ, define u*(A) and u (A) by (3.9) and (3.10) with u in 
place of P. Show that /*J,du=yp*(A) and {,1,du =, (A) for every A. 
Therefore, (15.12) can fail if f is not measurable Z. (Where was measurability 
used in the proof of (15.12)?) 


The definitions (15.3) and (15.6) always make formal sense (for finite u(Q) and 
sup|f |), but they are reasonable—accord with intuition—only if (15.12) holds. Under 
what conditions does it hold? 


15.4. 10.5 15.37 (a) Suppose of f that there exist an Fset A and a function g, 
measurable F, such that u(A)=0 and [f#g]CA. This is the same thing as 
assuming that u*[ f # g] = 0, or assuming that f is measurable with respect to F 
completed with respect to u. Show that (15.12) holds. 


(b) Show that if (15.12) holds, then so does the italicized condition in part (a). 


Rather than assume that f is measurable Z, one can assume that it satisfies the 
italicized condition in Problem 15.4(a)—which in case (Q, F, u) is complete is the 
same thing anyway. For the next three problems, assume that (2) < œ and that f is 
measurable Y and bounded. 


15.5. 1 Show that for positive e there exists a finite partition {A,} such that, if {B)} 
is any finer partition and w = By), then 


friu- Eroun) 


a4 (i 


15.6. t Show that 


mil ka | E 
ffau =lim > ule: = <f(w) < zn. - 
n Iklsn2" = 


“o 


The limit on the right here is Lebesgue’s definition of the int 


206 INTEGRATION 


15.7. 1 Suppose that the integral is defined for simple nonnegative functions by 
(2, x,1,) du = L;x,;u(A)). Suppose that f„ and g,, are simple and nondecreas- 
ing and have a common limit: 0 < f, t f and 0<g, Î f. Adapt the arguments 
used to prove Theorem 15.1(iii) and show that lim,, [fn du =lim,, fg, du. Thus, 
in the nonnegative case, [fdu can (Theorem 13.5) consistently be defined as 


lim, ff, du for simple functions for which 0 < f, î f. 


SECTION 16. PROPERTIES OF THE INTEGRAL 


Equalities and Inequalities 


By definition, the requirement for integrability of f is that /f* du and 
[f du both be finite, which is the same as the requirement that /f* du + 
(f- du < œ and hence is the same as the requirement that ISS )dp <a 
(Theorem 15.1(iv)). Since ft+f-=If|, f is integrable if and only if 


(16.1) fifldu <o. 


It follows that if |f|<|g|'almost everywhere and g is integrable, then f is 
integrable as well. If u() < œ, a bounded f is integrable. 


Theorem 16.1. (i) Monotonicity: If f and g are integrable and f < g almost 
everywhere, then 


(16.2) ffdu < fgdu. 


(ii) Linearity: If f and g are integrable and a, B are finite real numbers, then 
af + Bg is integrable and 


(16.3) [(af+Bs)du=affdu+Bfedu. 


Proor oF (i). For nonnegative f and g such that f <g almost every- 
where, (16.2) follows by Theorem 15.2(iv). And for general integrable f and 
g, if f <g almost everywhere, then f*<g* and f->g~ almost everywhere, 
and so (16.2) follows by the definition (15.6). a 


ProorF oF (ii). First, af + Bg is integrable because, by Theorem 15.1, 
flaf+ Beldu < f (lal: If1+IBl- lgl) du 


=lalfifldu +IBI figl du <œ. 


SECTION 16. PROPERTIES OF THE INTEGRAL 207 


By Theorem 15.1(iv) and the definition (15.6), [(af) du = a{fdu—consider 
separately the cases a > 0 and a <0. Therefore, it is enough to check (16.3) 
for the case a= B = 1. By definition, (f+ g)*-(f+g) =f+g8=f'-f + 
g*—g and therefore (f+ g)*+f-+g =(f+g) +ft+g*. All these func- 
tions being nonnegative, /(f+g)* du + [f du + Ja” du = {(f+g) dwt 
[f> du + fg* du, which can be rearranged to give /(f+g)* du — [(f + 
g) du =f" du — ff du + fg du — fg~ dw. But this reduces to (16.3). m 


Since —|f|<f<|f\, it follows by Theorem 16.1 that 


(16.4) [fan 


< fifldu 


for integrable f. Applying this to integrable f and g gives 


(16.5) |ffdu- fedu|< fif—sldn. 


Example 16.1. Suppose that Q is countable, that F consists of all the 
subsets of Q, and that u is counting measure: each singleton has measure 1. 
To be definite, take Q = {1,2,...}. A function is then a sequence x,,X5,.... 
If Xum IS Xm Or 0 as m<n or m>n, the function corresponding to 
Xa Xn2»--- has integral Xm _,x,, by Theorem 15.1(i) (consider the decompo- 
sition {1},...,{m}, {n +1,n+2,...}). It follows by Theorem 15.1(iii) that in 
the nonnegative case the integral of the function given by {x,,} is the sum 
LmXm (finite or infinite) of the corresponding infinite series. In the general 
case the function is integrable if and only i 2e 11% | is a convergent infinite 
series, in which case the integral is L?,_, x), ADE T 

The function x,,=(-—1)"*'m7! is not ees by this definition and 
even fails to have a definite integral, since X» -1x5 = E> -xp =~. This 
invites comparison with the ordinary theory of infinite series, according to 
which the alternating harmonic series does converge in the sense that 
lim y E% _(—1)"*!m~! = log2. But since this says that the sum of the first 
M terms has a limit, it requires that the elements of the space Q be ordered. 
If © consists not of the positive integers but, say, of the integer lattice points 
in 3-space, it has no canonical linear ordering. And if ¥,,x,, is to have the 
Same finite value no matter what the order of summation, the series must be 
absolutely convergent.’ This helps to explain why f is defined to be inte- 
grable only if /f* du and [f dy are both finite. a 


Example 16.2. In connection with Example 15.1, consider the funci 
f=31, (a,x) ~ 21, -«,a There is no natural value for {fda (it is “œ — co”), 
none is assigned by the definition. 


Rupin}, p. 76. 


208 INTEGRATION 

If a function f is bounded on bounded intervals, then each function 
fa =fl;n.n) is integrable with respect to A. Since f= lim, fa, the limit of 
If, dA, if it exists, is sometimes called the “principal value” of the integral of 
f. Although it is natural for some purposes to integrate symmetrically about 
the origin, this is not the right definition of the integral in the context of 
general measure theory. The functions g,,=fl(_,,,+1) for example also 
converge to f, and fg, dA may have some other limit, or none at all; f(x)=x 
is a case in point. There is no general reason why f,, should take precedence 
over g,- 

As in the preceding example, f = Lg_,(— D*k~ "Lx, ¢+1) has no integral, 
even though the /f, dA above converge. a 


Integration to the Limit 


The first result, the monotone convergence theorem, essentially restates Theo- 
rem 15.1(iii). 


Theorem 16.2. If 0<f, t f almost everywhere, then ff, du î {fdp. 


Proor. If0<f, tT f onaset A with u( A‘) =0, then 0 <f,,1, 1 fl, holds 
everywhere, and it follows by Theorem 15.1(iii) and the remark following 
Theorem 15.2 that /f, du = ff ladu î ffl du = ffdp. E 


As the functions in Theorem 16.2 are nonnegative almost everywhere, all 
the integrals exist. The conclusion of the theorem is that lim, ff, du and 
[fdu are both infinite or both finite and in the latter case are equal. 


Example 16.3. Consider the space {1,2,...} together with counting mea- 
sure, as in Example 16.1. If for each m one has 0 <S<Xnm 1X, aS n>, then 
lim, UmnXpam = &mXm a Standard result about infinite series. a 


Example 16.4. If u is a measure on F, and F, is a o-field contained in 


F, then the restriction wo of u to F is another measure (Example 10.4). If ; 
f=, and 4 E A then 


[fdu = [fduo, 


the common value being 4(A)=j1(A). The same is true by linearity for 
nonnegative simple functions measurable A. It holds by Theorem 16.2 for 
all nonnegative f that are measurable A, because (Theorem 13.5) 0 sf rf 
for simple functions f, that are measurable F. For functions measurable 
Fo, integration with respect to is thus the same thing as integratio 
respect to uo. — 


SECTION 16. PROPERTIES OF THE INTEGRAL 209 


In this example a property was extended by linearity from indicators to 
nonnegative simple functions and thence to the general nonnegative function 


by a monotone passage to the limit. This is a technique of very frequent 
application. 


Example 16.5. For a finite or infinite sequence of measures w, on F, 
u( A) =, ,(A) defines another measure (countably additive because [A27] 
sums can be reversed in a nonnegative double series). For indicators f, 


[fdu = Lf fein 


and by linearity the same holds for simple f > 0. If 0<f, T f for simple fp, 
then by Theorem 16.2 and Example 16.3, {fd = lim, ff, du = 
lim, L, /f, du, = £, lim, ff, du, =U, [fdan The relation in question thus 
holds for all nonnegative f. a 


An important consequence of the monotone convergence theorem is 
Fatou’s lemma: 


Theorem 16.3. For nonnegative f,,, 
(16.6) fiim inf f, du < lim inf | fy du. 


Proor. If 2, = mfi- Ik Mennie ene liminf, f„, and the preced- 
ing two theorems give ff, du > fg, du > Jgdu. B 


Example 16.6. On (R!, 2!,A), the functions f, = nelo and f=0 
satisfy f(x) > f(x) for each x, but /fdA =0 and ff, dA =n > ~. This shows 
that the inequality in (16.6) can be strict and that it is not always possible to 
integrate to the limit. This phenomenon has been encountered before; see 
Examples 5.7 and 7.7. w 


Fatou’s lemma leads to Lebesgue’s dominated convergence theorem: 


Theorem 16.4. If |f,,|<g almost everywhere, where g is integrable, and if 
f,, 2 f almost everywhere, then f and the f„ are integrable and Jf, du > {fdp. 


Proor. Assume at the outset, not that the f,, converge, but only thé 
they are dominated by an integrable g, which implies that all the f, as we 


210 INTEGRATION 


as f*=limsup, f, and f, =liminff, are integrable. Since g +f, and 
g —f, are nonnegative, Fatou’s lemma gives 


[edu + ffx du = flimin (e + f,) du 


<liminf f (8 +f,) du = [edu + liminf f f, du, 


and 
[edu - [f*du = fliminf (8 -fn) du 
< liminf f (3 —f,) du = [edu - limsup findu. 
Therefore 
(16.7) flim int, du < lim inf | f, du 


< lim sup [iz du < flimsupf, du. 


(Compare this with (4.9).) 
Now use the assumption that f, — f almost everywhere: f is dominated by 
g and hence is integrable, and the extreme terms in (16.7) agree with /fdu. 
a 


Example 16.6 shows that this theorem can fail if no dominating g exists. 


Example 16.7. The Weierstrass M-test for series. Consider the space 
{1,2,...} together with counting measure, as in Example 16.1. If |X,_/<Mm 
and ©, M,,, <œ, and if lim, X„m =,, for each m, then lim n im tage nate 
This follows by an application of aiheaten 16.4 with the furichiall given by the — 
sequence M,,M,,... in the role of g. This is another standard result on — 
infinite series [A28]. e 


The next result, the bounded convergence theorem, is a special case 
Theorem 16.4. It contains Theorem 5.4 as a further special case. 


Theorem 16.5. If (9) < œ and the f, are uniformly bounded, then fa- r f 
almost everywhere implies |f, du > [fdu. b 


The next two theorems are simply the series versions of the n 
dominated convergence theorems. ; 


SECTION 16. PROPERTIES OF THE INTEGRAL 211 


Theorem 16.6. If f,,>0, then [E f, du = LX, ff, du. 


The members of this last equation are both equal either to © or to the 
same finite, nonnegative real number. 


Theorem 16.7. If &, f,, converges almost everywhere and |Li_;f,| <8 
almost everywhere, where g is integrable, then ,,f,, and the f,, are integrable 
and f ae du a don Wr du. 


Corollary. If XL, /\f,|du<, then E„f, converges absolutely almost 
everywhere and is integrable, and | Ł„f, du =X, ff, du. 


Proor. The function g = Ł,„|f„l| is integrable by Theorem 16.6 and is 
finite almost everywhere by Theorem 15.2(iii). Hence L,|f,,| and L,f,, con- 
verge almost everywhere, and Theorem 16.7 applies. a 


In place of a sequence {f,,} of real measurable functions on (Q, F, u), consider a 


family [ f,: t > 0] indexed by a continuous parameter t. Suppose of a measurable f 
that 


(16.8) lim f,() =f() 
t— œ 
on a set A, where 
(16.9) AEF, p(Q-A)=0- 


A technical point arises here, since F need not contain the w-set where (16.8) holds: 


Example 16.8. Let F consist of the Borel subsets of Q = [0, 1), and let H be a 
nonmeasurable set—a subset of Q that does not lie in Z (see the end of Section 3). 
Define f,(w) =1 if w equals the fractional part t — |t] of t and their common value 
lies in H‘; define f,(w) = 0 otherwise. Each f, is measurable F, but if f(w) = 0, then 
the w-set where (16.8) holds is exactly H. a 


Because of such examples, the set A above must be assumed to lie in F. (Because 
of Theorem 13.4, no such assumption is necessary in the case of sequences.) 

Suppose that f and the f, are integrable. If /, = /f, du converges to I= [fdu as 
t > œ, then certainly Z, —> I for each sequence {t,} going to infinity. But the converse 
holds as well: If J, does not converge to J, then there is a positive e such that 
|I, —I|>e for a sequence {t,} going to infinity. To the question of whether I, 
converges to J the previous theorems apply. 

Suppose that (16.8) and |f,(w)|<g(w) both hold for w <A, where A satisfies — 
(16.9) and g is integrable. By the dominated convergence theorem, f and the f, must — 
then be integrable and J, >Z for each sequence {t,,} going to infinity. It follows that 
{f, du > {fdw. In this result £ could go continuously to 0 or to some other value 
instead of to infinity. _— 


212 INTEGRATION 


Theorem 16.8. Suppose that f(w,t) is a measurable and integrable function of w 
for each t in (a,b). Let $(t) = [f(w, Duda). 


(i) Suppose that for w € A, where A satisfies (16.9), f(w, t) is continuous in t at to; 
suppose falter that Re. t)|<g(w) for w EA and |t — tol < ô, where ô is independent 
of w and g is integrable. Then $(t) is continuous at to. -N 

(ii) Suppose that for w € A, where A satisfies (16.9), f(w, t) has in (a,b)a derivative 
f'(w,t); suppose further that |f'Cw,t)<glw) for w EA and t €(a,b), where g is 
integrable. Then o(t) has derivative {f'(w, t)u(dw) on (a, b). 


Proor. Part (i) is an immediate consequence of the preceding discussion. To 
prove part (ii), consider a fixed t. If w € A, then by the mean-value theorem, 


Floss) —f(@,t) aptly 99; 


where s lies between t and t +h. The ratio on the left goes’ to f'(w, t) as h > 0 and 
is by hypothesis dominated by the integrable function g(w). Therefore, 


SEE) OU) fEAT da) Oe | | 


The condition involving g in part (ii) can be weakened. It suffices to assume that 
for each ¢ there is an integrable g(w, t) such that | f’(w, s)| < g(w, t) for w E€ A and all 
s in some neighborhood of t. 


Integration over Sets 


The integral of f over a set A in F is defined by 


(16.10) J fau = fIsfdp. 


The definition applies if f is defined only on A in the first place (set f- H 
outside A). Notice that [4 fdu =0 if w(A) =0. B 


All the concepts and theorems above carry over in an obvious way o 
integrals over A. Theorems 16,6 and 16.7 yield this result: “oh 


Theorem 16.9. If A,, Aj,... are disjoint, and if f is either r 
integrable, then Jy 4,fdu =X, fa, fdu. 


‘Letting h go to 0 through a sequence shows that each f'C, is measurable.#on ia 
be 0, say, elsewhere. 


SECTION 16. PROPERTIES OF THE INTEGRAL 213 
The integrals (16.10) usually suffice to determine f: 


Theorem 16.10. (i) If f and g are nonnegative and [,fdu = [484 for all 
Ain F, and if u is o-finite, then f = g almost everywhere. 

(ii) If f and g are integrable and {,fdu = [,gdu forall A in F, then f =g 
almost everywhere. 

(iii) If f and g are integrable and f, fdu = {,gd for all A in P, where P 


is a t-system generating F and Q is a finite or countable union of P-sets, then 
f= g almost everywhere. 


PrRoor. Suppose that f and g are nonnegative and that {,fdu < {,gdu 
for all A in F. If w is o-finite, there are Æsets A, such that A, 10 
and u(A,)<o. If B,=[0<g<f, g<n], then the hypothesized inequal- 
ity applied to A,B, implies fya ap Sdm <Ja np gdu<©@ (finite be- 
cause A,B, has finite measure and g is bounded there) and hence 
fla nag(f—g)du =0. But then by Theorem 15.2(ii), the integrand is 0 
almost everywhere, and so u(A,„ N B„)= 0. Therefore, un[0<g<f, g <œ] = 
0, so that f < g almost everywhere; (i) follows. 

The argument for (ii) is simpler: If f and g are integrable and f Pfau < 
fagdu for all A in F, then flg --(f—g) du =0 and hence alg <f] = 0 by 
Theorem 15.2(ii). 

Part (iii) for nonnegative f and g follows from part (ii) together with 
Theorem 10.4. For the general case, prove that f*+g =f +g almost 
everywhere. ia] 


Densities 


Suppose that 6 is a nonnegative measurable function and define a measure v 
by (Theorem 16.9) 


(16.11) v( A) = f| dp, AEF: 
A 


ô is not assumed integrable with respect to u. Many measures arise in this 
way. Note that u(A) = 0 implies that v(A) = 0. Clearly, v is finite if and only 
if ô is integrable u. Another function 5’ gives rise to the same v if ô= ô' 
almost everywhere. On the other hand, v(A)= f4ô'dpu and (16.11) together 
imply that 6 = ô' almost everywhere if u is o-finite, as follows from Theorem 
16.10(i). 

The measure v defined by (16.11) is said to have density 5 with respect 
to u. A density is by definition nonnegative. 

Formal substitution dv = ô du gives the formulas (16.12) and (16.13). 


214 INTEGRATION 


Theorem 16.11. Zf v has density ô with respect to p, then 


(16.12) [fav = [fodp 


holds for nonnegative f. Moreover, f (not necessarily nonnegative) is integrable 
with respect to v if and only if fô is integrable with respect to m, in which case 


(16.12) and 
6. dv= ôd 
(16.13) J fav Fe m 


both hold. For nonnegative f, (16.13) always holds. 


Here fô is to be taken as 0 if f = 0 or if ô= 0; this is consistent with the 
conventions (15.2). Note that v[6 = 0] = 0. 


Proor. If f=1,, then {fdv=v(A), so that (16.12) reduces to the 
definition (16.11). If f is a simple nonnegative function, (16.12) then follows 
by linearity. If f is nonnegative, then ff, dv = {f,6dp for the simple func- 
tions f„ of Theorem 13.5, and (16.12) follows by a monotone passage to the 
limit—that is, by Theorem 16.2. Note that both sides of (16.12) may be 
infinite. 

Even if f is not nonnegative, (16.12) applies to |f|, whence it follows that 
f is integrable with respect to v if and only if fô is integrable with respect to 
uw. And if f is integrable, (16.12) follows from differencing the same result for 
f* and f. Replacing f by fI} leads from (16.12) to (16.13). a 


Example 16.9. If v(A)= u(AN4A,), then (16.11) holds with ô =T and 
(16.13) reduces to [4 fdv = Jaaa J am: a 
Theorem 16.11 has two features in common with a number of theorems 
about integration: 
(i) The relation in question, (16.12) in this case, in addition to holding for 
integrable functions, holds for all nonnegative functions—the point being 
that if one side of the equation is infinite, then so is the other, and if both are 
finite, then they have the same value. This is useful in checking for integrabil- 
ity in the first place. 


(ii) The result is proved first for indicator functions, then for simple 


functions, then for nonnegative functions, then for integrable functions. In 
this connection, see Examples 16.4 and 16.5. 3 


bal 


The next result is Scheffé’s theorem. 


SECTION 16. PROPERTIES OF THE INTEGRAL 215 


Theorem 16.12. Suppose that v,(A)= {,5, du and v(A)= f,6dp for 
densities 5, and ô. If 


(16.14) v,(Q)=v(N) <0, n=1,2,..., 


and if 5, — ô except on a set of -measure 0, then 
(16.15) sup |»( A) —»,( A) < f lô- 8,1 du > 0. 
AEF Q 


Proor. The inequality in (16.15) of course follows from (16.5). Let 
g, =ô — ô, The positive part g} of g, converges to 0 except on a set of 
u-measure 0. Moreover, 0<g* <ô and ô is integrable, and so the domi- 
nated convergence theorem applies: fg} du > 0. But fg, du = 0 by (16.14), 
and therefore 


falendu = f 8n du 


En a [g, <0] 
= 2 
J 


A corollary concerning infinite series follows immediately—take u as 
counting measure on Q = {1,2,...}. 


gadu =2f si du > 0. a 


g, 20) 


Corollary. If L,,Xnm = Um%m<%®, the terms being nonnegative, and if 
lim, Xam =Xm for each m, then lim,, © ,|X nm —Xm| = 0. If y,, is bounded, then 
lim, D PE RA = Yaka 


Change of Variable 


Let (Q, F) and (Q', F') be measurable spaces, and suppose that the 
mapping T: Q > ( is measurable F/ F'. For a measure u on F, define a 
measure 47 ' on F' by 


(16.16) wT (A) =p(T7'A4'), Ae F', 


as at the end of Section 13. 

Suppose f is a real function on Q' that is measurable ¥', so that the 
composition fT is a real function on Q that is measurable F (Theorem 
13.1(ii)). The change-of-variable formulas are (16.17) and (16.18). If 4’ = ny 
the second reduces to the first. 


216 INTEGRATION 


Theorem 16.13. If f is nonnegative, then 
(16.17) | Touda) = | font ' (dw). 
( 


A function f (not necessarily nonnegative) is integrable with respect to pT ' if 
and only if fT is integrable with respect to p, in which case (16.17) and 


(16.18) f f(To)u(do) = | f(a wT (do) 


hold. For nonnegative f, (16.18) always holds. 


Proor. If f= I}, then fT =J,-1,, and so (16.17) reduces to the defini- 
tion (16.16). By linearity, (16.17) holds for nonnegative simple functions. If f, 
are simple functions for which 0<f, t f, then O<f,7 t fT, and (16.17) 
follows by the monotone convergence theorem. 

An application of (16.17) to |f| establishes the assertion about integrabil- 
ity, and for integrable f, (16.17) follows by decomposition into positive and 
negative parts. Finally, if f is replaced by fI}, (16.17) reduces to (16.18). m 


Example 16.10. Suppose that (Q', ¥’) =(R', ZB!) and T= ọ is an ordi- 
nary real function, measurable F. If f(x) =x, (16.17) becomes 


(16.19) J e(o)u(do) = f rue "(dr), 


If ọ = Ł;x;l4, is simple, then pp" has mass u(4A,) at x,, and each side of 
(16.19) ods to Lx jA), a 


Uniform Integrability 


If f is integrable, then |f|J,-;.,; goes to 0 almost everywhere as æ — and is 
dominated by |f|, and hence 


(16.20) lim Ifldu =0. 


aol flza} 


A sequence {f,} is uniformly integrable if (16.20) holds uniformly in n: 


ao 


(16.21) lim supf I f,ldu =0. 
n “Ufplea) 


SECTION 16. PROPERTIES OF THE INTEGRAL 217 


If (16.21) holds and u(Q) <, and if a is large enough that the supremum in 
(16.21) is less than 1, then 


(16.22) [\fnldu <ap(Q) +1, 


and hence the f„ are integrable. On the other hand, (16.21) always holds if the f,, are 
uniformly bounded, but the f, need not in that case be integrable if u() = œ. For 
this reason the concept of uniform integrability is interesting only for u finite. 

If A is the maximum of |f| and |g|, then 


fo Wfteldus2f hdus2f fldu+2f E 
lf+gle2a h>a lflaa lgloa@ 


Therefore, if {f„} and {g,} are uniformly integrable, so is (f, + g,}. 
Theorem 16.14. Suppose that u(Q) < œ and f,, > f almost everywhere. 


(i) If the f, are uniformly integrable, then f is integrable and 


(16.23) Ji ea flp 
(ii) If f and the f,, are nonnegative and integrable, then (16.23) implies that the f, 
are uniformly integrable. 


Proor. If the f, are uniformly integrable, it follows by (16.22) and Fatou’s 
lemma that f is integrable. Define 


(a) _ ie if If,|<a, Gi f if |fl<a, 
In = \ if |f,|>a, 0 if |fl>a. 


If ullf|=a]=0, then f{® +f almost everywhere, and by the bounded conver- 
gence theorem, 


(16.24) [ff dp irae 

Since 

(16.25) fft ana [fx do fe cal H ; 
and 


(16.26) ffan- [fF du = f S” 


218 INTEGRATION 


it follows from (16.24) that 


i du — | fdu| < su (faldu + fldu. 
a lk u ffau spf f H He ly m 


And now (16.23) follows from the uniform integrability and the fact that u[|f|=a]= 0 
for all but countably many a. 

Suppose on the other hand that (16.23) holds, where f and the f,, are nonnegative 
and integrable. If ul f=a]=0, then (16.24) holds, and from (16.25) and (16.26) 


follows 
(16.27) E ies 
fanza fza 


Since f is integrable, there is, for given e, an æ such that the limit in (16.27) is less 
than € and wl f=a]=0. But then the integral on the left is less than e for all n 
exceeding some ny. Since the f, are individually integrable, uniform integrability 
follows (increase a). a 


Corollary. Suppose that u(Q)< œ. If f and the f, are integrable, and if f, >f 
almost everywhere, then these conditions are equivalent: 


(i) f,, are uniformly integrable; 
Gi), lf =f, laden 0; 
Gii) (lf, |du > flfldp. 


Proor. If (i) holds, then the differences |f—/f,,| are uniformly integrable, and 
since they converge to 0 almost everywhere, (ii) follows by the theorem. And (ii) 
implies (iii) because || f|—|f,,||<|f—/f,,|. Finally, it follows from the theorem that (iii) 
implies (i). ag 


Suppose that 
(16.28) sup fl fjl'** du < 
for a positive e. If K is the supremum here, then 


1 
e dae Bells 


and so {f,,} is uniformly integrable. 


fal du < 


l>a 


Complex Functions 


A complex-valued function on Q has the form f(w)=g(w)+ ih(w), where g and A 
are ordinary finite-valued real functions on Q. It is, by definition, measurable F if g 
and h are. If g and h are integrable, then f is by definition integrable, and its 3 
integral is of course taken as a Ed 


(16.29) [(e+ih) du = [edu t+ifhdp. 


SECTION 16. PROPERTIES OF THE INTEGRAL 219 


Since max{|g|, |A} <|f|<|g|+ Al, f is integrable if and only if {| f| du < %, just as in 
the real case. 

The linearity equation (16.3) extends to complex functions and coefficients—the 
proof requires only that everything be decomposed into real and imaginary parts. 
Consider the inequality (16.4) for the complex case: 


< fifldn. 


(16.30) [fan 


If f=g +ih and g and h are simple, the corresponding partitions can be taken to be 
the same (g=2L,x,1,, and h = È; Yk lA ), and (16.30) follows by the triangle inequal- 
ity. For the petieral integrable vB represent g and A as limits of simple functions 
dominated by |f|, and pass to the limit. 

The results on integration to the limit extend as well. Suppose that f, =g, + ih, 
are complex functions satisfying L, /|f,|du < œ. Then L, {|g,| du <œ, and so by the 
corollary to Theorem 16.7, ©, g, is integrable and integrates to L, fg, du. The same 
is true of the imaginary parts, and hence ©, f, is integrable and 


(16.31) [Uh de =D [fan 
k k 


PROBLEMS 


16.1. 13.91 Suppose that u()< œ and f, are uniformly bounded. 
(a) Assume f, > f uniformly and deduce ff, du — ffdu from (16.5). 
(b) Use part (a) and Egoroff’s theorem to give another proof of Theorem 16.5. 


16.2. Prove that if 0 <f,— f almost everywhere and /f, du <A<o, then f is 
integrable and [fdp <A. (This is essentially the same as Fatou’s lemma and is 
sometimes called by that name.) 


16.3. Suppose that f,, are integrable and sup, /f,, du < ©. Show that, if f, f, then 
f is integrable and ff, du > {fdw. This is Beppo Levi’s theorem. 


16.4. (a) Suppose that functions a,,b,, f,, converge almost everywhere to func- 
tions a,b, f, respectively. Suppose that the first two sequences may be integra- 
ted to the limit—that is, the functions are all integrable and fa, du > fady, 
Jb, du > {bdp. Suppose, finally, that the first two sequences enclose the third: 

<f,, <b, almost everywhere. Show that the third may be integrated to the 
limit. 
(b) Deduce Lebesgue’s dominated convergence theorem from part (a). 


16.5. About Theorem 16.8: 


(a) Part (i) is local: there can be a different set A for each tọ. Part (ii) can be 
recast as a local theorem. Suppose that for w © A, where A satisfies (16.9), 


220 


16.6. 


16.7. 


16.8. 


16.9. 


16.10. 


16.11. 


16.12. 


INTEGRATION 


flw, t) has derivative f'(w, to) at to; suppose further that 


fodar A) a ata (9 


(16.32) 


for w EA and 0 < |A| <6, where ô is independent of w and g, is integrable. 
Then (to) = [f'(@, to) u(dw). 

The natural way to check (16.32), however, is by the mean-value theorem, 
and this requires (for w E€ A) a derivative throughout a neighborhood of to. 
(b) If u is Lebesgue measure on the unit interval Q, (a,b) =(0,1), and 
flw, t) = Lo. )(@), then part (i) applies but part (ii) does not. Why? What about 
(16.32)? 


Suppose that f(w,-) is, for each w, a function on an open set W in the 
complex plane and that f(-,z) is for z in W measurable ¥ and integrable. 
Suppose that A satisfies (16.9), that f(w, <) is analytic in W for w in A, and 
that for each zọ in W there is an integrable g(-,z,) such that |f’(w, z)| < 
g(w, Zo) for all w €A and all z in some neighborhood of zọ. Show that 
[f(@, z)u(dw) is analytic in W. 


(a) Show that if |f,,|<g and g is integrable, then {f,} is uniformly integrable. 
Compare the hypotheses of Theorems 16.4 and 16.14. 

(b) On the unit interval with Lebesgue measure, let f, = (n/log n)Io, n-'y for 
n > 3. Show that the f, are uniformly integrable (and /f, du — 0) although 
they are not dominated by any integrable g. 

(c) Show for f, = 710, n-!0) — n-!,2n-1) that the f, can be integrated to the 
limit (Lebesgue measure) even though the f,, are not uniformly integrable. 


Show that if f is integrable, then for each e there is a ô such that w(A) <ô 
implies [,|f|dp <e. 


T Suppose that u(Q) < œ. Show that {f,} is uniformly integrable if and only 
if (i) S| faldu is bounded and (ii) for each e there is a 5 such that p(A) <5 
implies /,|f,,|du <e for all n. 


2.19 16.9 T Assume p(2D) < œ, 

(a) Show by examples that neither of the conditions (i) and (ii) in the 
preceding problem implies the other. 

(b) Show that (ii) implies (i) for all sequences {f,} if and only if u is 
nonatomic. 


Let f be a complex-valued function integrating to re'®, r>0. From {(|f(@)|- 
e F(w)) udu) = {| f| du —r, deduce (16.30). 


11.5 1 Consider the vector lattice and the fundional A of Problems 11.4 
and 11.5. Let u be the extension (Theorem 11.3) to #=a(Ap) of the set 
function uo on Fo. äi j 
(a) Show by (11.7) that for positive x, yı, y> one has »([f>1]x(0,xD)= 
xuolf>1]=xul f> 1] and vy, <f < y2] X O, xD =xulyı <fíyl 


SECTION 17. THE INTEGRAL WITH RESPECT TO LEBESGUE MEASURE 221 


(b) Show that if f€, then f is integrable and 


A(f) = f fdu. 


(Consider first the case f>0.) This is the Daniell-Stone representation 
theorem. 


SECTION 17. THE INTEGRAL WITH RESPECT TO LEBESGUE 
MEASURE 


The Lebesgue Integral on the Line 


A real measurable function on the line is Lebesgue integrable if it is 
integrable with respect to Lebesgue measure A, and its Lebesgue integral 
[fda is denoted by /f(x) dx, or, in the case of integration over an interval, by 
(f(x) dx. The theory of the preceding two sections of course applies to the 
Lebesgue integral. It is instructive to compare it with the Riemann integral. 


The Riemann Integral 


A real function f on an interval (a, b] is by definition’ Riemann integrable, 
with integral r, if this condition holds: For each e there exists a ô with the 
property that 


(17.1) ro ICDA) <e 


if {Z} is any finite partition of (a, b] into subintervals satisfying A(7,) < ô and 
if x;€ l; for each i. The Riemann integral for step functions was used in 
Section 1. 

Suppose that f is Borel measurable, and suppose that f is bounded, so 
that it is Lebesgue integrable. If f is also Riemann integrable, the r of (17.1) 
must coincide with the Lebesgue integral /f(x) dx. To see this, first note 
that letting x; vary over J; leads from (17.1) to 


(472) r— ) sup f(x) -A(J;)| <e. 


in SET; 


Consider the simple function g with value sup, <,,f(x) on J;. Now f < g, ai 
the sum in (17.2) is the Lebesgue integral of g. By monotonicity of t 


‘For other definitions, see the first problem at the end of the section and the Note 
terminology following it. 


y 
y 


Bo 
r 


222 INTEGRATION 


Lebesgue integral, [ f(x) dx < [?g(x)dx <r +e. The reverse inequality fol- 
lows in the same way, and so [?f(x) dx =r. Therefore, the Riemann integral 
when it exists coincides with the Lebsegue integral. 

Suppose that f is continuous on [a, b]. By uniform continuity, for each e 
there exists a 5 such that | f(x) — f(y)| < €/(b — a) if |x — y| < ô. If AU) < ô 
and x,EJ,, then g=L,f(x,)I, satisfies |f—g|<e/(b—a) and hence 
| /?fdx — [?g¢dx|<e. But this is (17.1) with r replaced (as it must be) by the 
Lebesgue integral /{?fdx: A continuous function on a closed interval is 
Riemann integrable. 


Example 17.1. If f is the indicator of the set of rationals in (0, 1], then the 
Lebesgue integral {j(x) dx is 0 because f = 0 almost everywhere. But for an 
arbitrary partition {J,} of (0, 1] into intervals, £; f(x;)AU;) with x; €I; is 1 if 
each x, is taken from the rationals and 0 if each x; is taken from the 
irrationals. Thus f is not Riemann integrable. a 


Example 17.2. For the f of Example 17.1, there exists a g (namely, g = 0) 
such that f=g almost everywhere and g is Riemann integrable. To show 
that the Lebesgue theory is not reducible to the Riemann theory by the 
casting out of sets of measure 0, it is of interest to produce an f (bounded 
and Borel measurable) for which no such g exists. 

In Examples 3.1 and 3.2 there were constructed Borel subsets A of (0, 1] 
such that 0 < A(A) < 1 and such that ACA N /)>0 for each subinterval J of 
(0,1). Take f= Z}. Suppose that f =g almost everywhere and that {I} is a 
decomposition of (0, 1] into subintervals. Since AU; 1A N[f=g])) = ACUI, AOA) 
> 0, it follows that g(y,) = f(y,;) = 1 for some y, in I; AA, and therefore, 


(17.3) 2 8(y;)AC;) = 1>A(A). 


If g were Riemann integrable, its Riemann integral would coincide with the 
Lebesgue integrals {g dx = {fdx = ACA), in contradiction to (17.3). a 


It is because of their extreme oscillations that the functions in Examples _ 
17.1 and 17.2 fail to be Riemann integrable. (It can be shown that a bo ie 
function on a bounded interval is Riemann integrable if and only if the set ¢ 
its discontinuities has Lebesgue measure 0.*) This cannot happen in the case 
of the Lebesgue integral of a measurable function: if f fails to be Lebesg 
integrable, it is because its positive part or its negative part is too lai ve, not 
because one or the other is too irregular. MEE 


tSee Problem 17.1. 


SECTION 17. THE INTEGRAL WITH RESPECT TO LEBESGUE MEASURE 223 


Example 17.3. It is an important analytic fact that 


(17.4) lim (S dr = 5. 
0 


The existence of the limit is simple to prove, because IG tye BO OE 
alternates in sign and its absolute value decreases to 0; the value of the limit 
will be identified in the next section (Example 18.4). On the other hand, 
x~'sin x is not Lebesgue integrable over (0,0), because its positive and 
negative parts integrate to œ. Within the conventions of the Lebesgue theory, 
(17.4) thus cannot be written /®x~! sin xdx =7/2—although such “im- 
proper” integrals appear in calculus texts. It is, of course, just a question of 
choosing the terminology most convenient for the subject at hand. = 


The function in Example 17.2 is not equal almost everywhere to any 
Riemann integrable function. Every Lebesgue integrable function can, how- 


ever, be approximated in a certain sense by Riemann integrable functions of 
two kinds. 


Theorem 17.1. Suppose that {| f|dx <œ and e>0. 


(i) There is a step function g = D*_,x,I 4, with bounded intervals as the A,, 
such that [| f —g| dx <e. 


(ii) There is a continuous integrable h with bounded support such that 
[\f—WMl\ ax ~<e.- 


Proor. By the construction (13.6) and the dominated convergence theo- 
rem, (i) holds if the A; are not required to be intervals; moreover, A(A,) < œ% 
for each i for which x, #0. By Theorem 11.4 there exists a finite disjoint 
union B; of intervals such that ACA;AB;) < e/k|x;l. But then £,x;/, satisfies 
the requirements of (i) with 2e in place of e. 

To prove (ii) it is only necessary to show that for the g of (i) there is a 
continuous h such that {|g —h| dx <e. Suppose that A; = (a; b;l; let hx) 
be 1 on (a,, b;] and 0 outside (a; — ô, b; + ô], and let it increase linearly from 
0 to 1 over (a;—65,a;] and decrease linearly from 1 to 0 over (b,,b; + ôl]. 
Since {|1, —h,|\dx >0 as 6-0, h=Xx;h, for sufficiently small 6 will 
Satisfy the requirements. a 


The Lebesgue integral is thus determined by its values for continuous — 
functions." R 


"This provides another way of defining the Lebesgue integral on the line. See Problem 17.13. ; 


224 INTEGRATION 


The Fundamental Theorem of Calculus 


Adopt the convention that [f = — /' if a > B. For positive h, 


< g f° UO) -Ady 


< sup[|f(y) —f(x)i x <y <x +h], 


KE Ow- 


and the right side goes to 0 with h if f is continuous at x. The same thing 
holds for negative h, and therefore [žf(y)dy has derivative f(x): 


(175) OLEG 


if f is continuous at x. 
Suppose that F is a function with continuous derivative F’ =f; that is, 
suppose that F is a primitive of the continuous function f. Then 


(17.6) freya S EG 


as follows from the fact that F(x)— F(a) and /*f(y)dy agree at x =a and 
by (17.5) have identical derivatives. For continuous f, (17.5) and (17.6) are 
two ways of stating the fundamental theorem of calculus. To the calculation 
of Lebesgue integrals the methods of elementary calculus thus apply. 


As will follow from the general theory of derivatives in Section 31, (17.5) holds 
outside a set of Lebesgue measure 0 if f is integrable—it need not be continuous. AS 
the following example shows, however, (17.6) can fail for discontinuous fi 


Example 17.4. Define F(x) =x? sin x~? for 0 <x <4 and F(x) = 0 for x < 0 and 
for x> 1. Now for 5<x<1 define F(x) in such a way that F is continuously 
differentiable over (0,~). Then F is everywhere differentiable, but F’(0)=0 and 
F(x) =2x sin x7? — 2x7! cos x~? for 0<x < +. Thus F' is discontinuous at 0; F’ is, 
in fact, not even integrable over (0, 1], which makes (17.6) impossible for a = 0. 

For a more extreme example, decompose (0, 1] into countably many subintervals 
(an, b„]. Define G(x) =0 for x <0 and x>1, and on (a,, b„] define G(x) = F(x - 
a,,)/(b, —a,,)). Then G is everywhere differentiable, but (17.6) is impossible for G if 
(a, b] contains any of the (a,,, b,,], because G is not integrable over any of them. B 


Change of Variable 


For 


(17.7) [a,b] 5 [u,v] oR’, 


SECTION 17. THE INTEGRAL WITH RESPECT TO LEBESGUE MEASURE 225 


the change-of-variable formula is 


(17.8) S ETOT x) de = f Fy) dy. 


If 7’ exists and is continuous, and if f is continuous, the two integrals are 
finite because the integrands are bounded, and to prove (17.8) it is enough to 
let b be a variable and differentiate with respect to it.’ 


With the obvious limiting arguments, this applies to unbounded intervals 
and to open ones: 


Example 17.5. Put T(x)=tanx on (—7/2,7/2). Then T(x)=1+ 
T +(x), and (17.8) for f(y) =(1 + y?)~! gives 


(199) il E =. a 


The Lebesgue Integral in R* 


The k-dimensional Lebesgue integral, the integral in (R*, A*,A,), is de- 
noted /f(x) dx, it being understood that x =(x,,..., x,). In low-dimensional 
cases it is also denoted //, f(x,, x,) dx, dx,, and so on. 

As for the rule for changing variables, suppose that T: U > R*, where U is 
an open set in R*. The map has the form Tx =(t,(x),...,t,(x)); it is by 
definition continuously differentiable if the partial derivatives t, (x) = 0t,/dx;, 
exist and are continuous in U. Let D, =[t,,(x)] be the Jacobian matrix, let 
J(x) = det D, be the Jacobian determinant, and let V = TU. 


Theorem 17.2. Let T be a continuously differentiable map of the open set U 
onto V. Suppose that T is one-to-one and that J(x)#0 for all x. If f is 
nonnegative, then 


(17.10) fS f(D) dx = JIO) dy. 


By the inverse—function theorem [A35], V is open and the inverse point 
mapping T7! is continuously differentiable. It is assumed in (17.10) that f: 
V> R! is a Borel function. As usual, for the general f, (17.10) holds with |f| 
in place of f, and if the two sides are finite, the absolute-value bars can be 
removed; and of course f can be replaced by fl, or fIra. 


"See Problem 17.11 for extensions. 


226 INTEGRATION 


Example 17.6. Suppose that T is a nonsingular linear transformation on 
U =V = RY. Then D, is for each x the matrix of the transformation. If T is 
identified with this matrix, then (17.10) becomes 


(17.11) det TIJ f(T) dx = JJO) dy. 


If f=J;,4, this holds because of (12.2), and then it follows in the usual 
sequence for simple f and for the general nonnegative f: Theorem 17.2 is 
easy in the linear case. a 


Example 17.7. In R? take U=[(p, 0): p>0, 0<6 <2] and T(p, 6) = 
(p cos 8, p sin 0). The Jacobian is J(p, 6) =p, and (17.10) gives the formula 
for integrating in polar coordinates: 


(17.12) f ERN f(p cos 6, p sin P)pdpdð= | | f(x,y) dedy. 


0<0<27 


Here V is R? with the ray [(x,0): x > 0] removed; (17.12) obviously holds 
even though the ray is included on the right. If the constraint on @ is 
replaced by 0 < @ < 47, for example, then (17.12) is false (a factor of 2 is 
needed on the right). This explains the assumption that T is one-to-one. m 


Theorem 17.2 is not the strongest possible; it is only necessary to assume that T is 
one-to-one on the set Uy =[x €U: J(x) #0]. This is because, by Sard’s theorem,* 


PROOF OF THEOREM 17.2. Suppose it is shown that 


(17.13) J f(T) (x) de = SJOA 


for nonnegative f. Apply this to T~': V > U, which [A35] is continuously differen- 
tiable and has Jacobian J~'(T~'y) at y: 


[g(T yy) dy > fs) dx 


for nonnegative g on V. Taking g(x) =f(Tx)|J(x)| here leads back to (17.13), but 
with the inequality reversed. Therefore, proving (17.13) will be enough. 
For f = I;,4, (17.13) reduces to 


(17.14) J IC) dx = (TA). 


tSprvaKk, p. 72. 


SECTION 17. THE INTEGRAL WITH RESPECT TO LEBESGUE MEASURE 227 


Each side of (17.14) is a measure on X= UN A. If æ consists of the rectangles A 
satisfying A CU, then Æ is a semiring generating X, U is a countable union of 
sets, and the left side of (17.14) is finite for A in Æ (sup,-|J|< œ). It follows by 
Corollary 2 to Theorem 11.4 that if (17.14) holds for A in æ, then it holds for A in 
%. But then (linearity and monotone convergence) (17.13) will follow. 

Proof of (17.14) for A in Æ. Split the given rectangle A into finitely many 
subrectangles Q; satisfying 


(17.15) diam Q, < ô, 


5 to be determined. Let x; be some point of Q;. Given e, choose 6 in the first place 
so that |J(x) —J(x')| < e€ if x, x’ EA and |x —x'| <6. Then (17.15) implies 


(17.16) LV(,)IA(Q;) < J UO) de + eA), 


Let Q;"* be a rectangle that is concentric with Q, and similar to it and whose edge 


lengths are those of Q; multiplied by 1 + e. For x in U consider the affine transforma- 
tion 


(17.17) $z =D. (zx) t I; JERE 


ġ,z will [A34] be a good approximation to Tz for z near x. Suppose, as will be 
proved in a moment, that for each e there is a ô such that, if (17.15) holds, then, for 
each i, $, approximates T so well on Q; that 


(17.18) TOIS OE. 


By Theorem 12.2, which shows in the nonsingular case how an affine transforma- 
tion changes the Lebesgue measures of sets, A (l, Q7 °) = |J(x DIA (Q7 °). If (17.18) 
holds, then 


(17.19) Ag(TA) = ZAL(TQ;) < LA (b,,07*) 
= DV (x,)IA,(Qi) = +6)" D(x), (2,). 


(This, the central step in the proof, shows where the Jacobian in (17.10) comes from.) 
If for each e there is a 5 such that (17.15) implies both (17.16) and (17.19), then 
(17.14) will follow. Thus everything depends on (17.18), and the remaining problem is 
to show that for each e there is a ô such that (17.18) holds if (17.15) does. 

Proof of (17.18). As (x, z) varies over the compact set A~ x[z: |z|= 1], |DJ 'z| is 
continuous, and therefore, for some c, 


(17.20) |D-'z|<elzl for x €A,zER*. 


<e/k*c for all j,/ if z,x €A and |z—x|<6. But then, by linear approximation 


[A34: (16)], ITz — Tx — D(z —x)| <ec7 '|z —x|<ec™'S. If (17.15) holds and ô< k 
then by the definition (17.17), 7 


Since the t, are uniformly continuous on A, 6 can be chosen so that lt a(z) -t a x)| 


(17.21) ITz ~$} zl<€/c for z€Q,. 


228 INTEGRATION 


To prove (17.18), note that z € Q; implies 


loc 'Tz — z|=|b¢'Tz — 6;,'b,,21= 1D; (Tz - b,,2)| 


<c|Tz = dy, Z| <E; 


where the first inequality follows by (17.20) and the second by (17.21). Since $; 'Tz i 
within e of the point z of Q;, it lies in Q;"*: ob, Tz EQ‘, or 12E 0, Qr Hence 
(17.18) holds, which completes the proof. m 


Stieltjes Integrals 


Suppose that F is a function on R* satisfying the hypotheses of Theorem 
12.5, so that there exists a measure u such that u(A)= A4F for bounded 
rectangles A. In integrals with respect to u, w(dx) is often replaced by 
dF (x): 


(17.22) J E) F(x) = f fuld). 


The left side of this equation is the Stieltjes integral of f with respect to F; 
since it is defined by the right side of the equation, nothing new is involved. 

Suppose that f is uniformly continuous on a rectangle A, and suppose 
that A is decomposed into rectangles A,, small enough that | f(x) — f(y)|< 
e/u(A) for x,y €A,. Then 


J flo are) - Df (tn) A4,F| <e 


for Xm EA m- In this case the left side of (17.22) can be defined as the limit of 
these approximating sums without any reference to the general theory of 
measure, and for historical reasons it is sometimes called the 
Riemann-Stieltjes integral; (17.22) for the general f is then called the 
Lebesgue-Stieltjes integral. Since these distinctions are unimportant in 
the context of general measure theory, {f(x)dF(x) and /fdF are best 
regarded as merely notational variants for {f(x)u(dx) and fdw. 


PROBLEMS 


Let f be a bounded function on a bounded interval, say [0,1]. Do not assume that f 
is a Borel function. Denote by L, f and L*f (L for Lebesgue) the lower and upper 
integrals as defined by (15.9) and (15.10), where u is now Lebesgue measure A on the 
Borel sets of [0,1]. Denote by R, f and R*f (R for Riemann) the same quantities but 
with the outer supremum and infimum in (15.9) and (15.10) extending only over finite 
partitions of [0, 1] into subintervals. It is obvious (see (15.11)) that 


(17.23) R,f<L,f<L*f <R*f. 


SECTION 17. THE INTEGRAL WITH RESPECT TO LEBESGUE MEASURE 229 


Suppose that f is bounded, and consider these three conditions: 


(i) There is an r with the property that for each e there is a ô such that (17.1) holds 
if {I;} partitions [0,1] into subintervals with ACI;) <6 and if x; € 1. 
(ii) R, f=R*f. 
(ili) If Dy is the set of points of discontinuity of f, then A(D;) = 0. 


The conditions are equivalent. 


17.1. Prove: 
(a) D; is a Borel set. 
(b) (i) implies (ii). 
(c) (ii) implies (iii). 
(d) (iii) implies (i). 
(e) The r of (i) must coincide with the R, f=R*f of (ii). 


A note on terminology. An f on the general (Q, F, u) is defined to be integrable 
not if (15.12) holds, but if (16.1) does. And an f on [0,1] is defined to be integrable 
with respect to Lebesgue measure not if L, f=L*f holds, but, rather, if 


(17.24) [lf Ide <x 


does. The condition L, f=L*f is not at issue, since for bounded f it always holds if 
f is a Borel function, and in this book f is always assumed to be a Borel function 
unless the contrary is explicitly stated. For the Lebesgue integral, the question is 
whether f is small enough that (17.24) holds, not whether it is sufficiently regular that 
L, f= L*f. For the Riemann integral, the terminology is different because R, f < R*f 
holds for all sorts of important Borel functions, and one way to define Riemann 
integrability is to require R, f = R*f. In the context of general integration theory, one 
occasionally looks at the Riemann integral, but mostly for illustration and compari- 
son. 


17.2. 3.15 17.17 (a) Show that an indicator 1, for A C[0,1] is Riemann inte- 
grable if and only if A is Jordan measurable. 


(b) Find a Riemann integrable function that is not a Borel function. 
17.3. Extend Theorem 17.1 to R£. 
17.4. Show that if f is integrable, then 
lim ie: +t) —f(x)|dx =0. 
10 
Use Theorem 17.1. 


17.5. Suppose that yw is a finite measure on #* and A is closed. Show that 
u(x + A) is upper semicontinuous in x and hence measurable. Sf 


17.6. Suppose that fől f(x)| dx < %. Show that for each e, A[x: x >a, | f(x)|> P 
as a —> œ. Show by example that f(x) need not go to 0 as x > œ (even if f is 
continuous). p 


230 INTEGRATION 


17.7. Let f (x)=x"7!—2x?"-!, Calculate and compare {jL,-,f,(x)dx and 
Dar | A f,(x) dx. Relate this to Theorem 16.6 and to the corollary to Theorem 


16.7. 


17.8. Show that (1 + y2)~! has equal integrals over (—™, — 1), (— 1,0), (0, 1), (1, œ). 
Conclude from (17.9) that [d +y?) 'dy =m/4. Expand the integrand in a 
geometric series and deduce Leibniz’s formula 

1 


T 1 1 
rar g wap. 


by Theorem 16.7 (note that its corollary does not apply). 


17.9. Show that if f is integrable, there exist continuous, integrable functions g, 
such thatg,(x) > f(x) except on a set of Lebesgue measure 0. (Use Theorem 
17.1(ii) with € =n~*.) 


17.10. 13.9 17.97 Let f be a finite-valued Borel function over [0,1]. By the 
following steps, prove Lusin’s theorem: For each e there exists a continuous 
function g such that A[x € (0,1): f(x) #g(x)] <e. | 
(a) Show that f may be assumed integrable, or even bounded. 

(b) Let g, be continuous functions converging to f almost everywhere. 
Combine Egoroffs theorem and Theorem 12.3 to show that convergence 
is uniform on a compact set K such that A((0,1)—K)<e. The limit 
lim,, g,,(x) =f(x) must be continuous when restricted to K. 

(c) Exhibit (0,1) — K as a disjoint union of open intervals 7, [A12]; define g as 
f on K, and define it by linear interpolation on each /,. 


17.11. Suppose in (17.7) that 7’ exists and is continuous and f is a Borel function, 
and suppose that /,?| f(7x)T'(x)| dx < ©. Show in steps that /7,,.4)|f(y)| dy < œ% 
and (17.8) holds. Prove this for (a) f continuous, (b) f= I, ip (© f= Ip 
(d) f simple, (e) f > 0, (f) f general. i 


17.12. 16.121 Let -/ consist of the continuous functions on R! with compact 
support. Show that -/ is a vector lattice in the sense of Problem 11.4 and has 
the property that f €- implies f^ 1€. (note that 1 ¢€/). Show that the 
a-field F generated by -Z is 2!. Suppose A is a positive linear functional on 
£, show that A has the required continuity property if and only if f„(x)}0 
uniformly in x implies A(f,,) > 0. Show under this assumption on A that there 
is a measure » on &' such that 


(17.25) A(f) = ffau, fer. 


Show that p is o-finite and unique. This is a version of the Riesz representation 
theorem. 


17.13. t Let A(f) be the Riemann integral of f, which does exist for f in Z. 
Using the most elementary facts about Riemann integration, show that the u 
determined by (17.25) is Lebesgue measure. This gives still another way of 
constructing Lebesgue measure. 


17.14. t Extend the ideas in the preceding two problems to R*. l P 


SECTION 18. PRODUCT MEASURE AND FUBINI’S THEOREM 231 
SECTION 18. PRODUCT MEASURE AND FUBINI’S THEOREM 


Let (X, 2°) and (Y, X) be measurable spaces. For given measures u and v 
on these spaces, the problem is to construct on the Cartesian product X x Y 
a product measure m such that 7(A X B) = w(A)v(B) for A CX and B CY. 
In the case where u and v are Lebesgue measure on the line, m will be 
Lebesgue measure in the plane. The main result is Fubini’s theorem, accord- 
ing to which double integrals can be calculated as iterated integrals. 


Product Spaces 


It is notationally convenient in this section to change from (Q, F ) to (X, Z) 
and (Y, &). In the product space X x Y a measurable rectangle is a product 
A XB for which A € 2 and BE & The natural class of sets in X XY to 
consider is the o-field ZX Y generated by the measurable rectangles. (Of 
course, 2°X Y is not a Cartesian product in the usual sense.) 


Example 18.1. Suppose that X = Y = R! and Z= Y= F'. Then a mea- 
surable rectangle is a Cartesian product A Xx B in which A and B are linear 
Borel sets. The term rectangle has up to this point been reserved for 
Cartesian products of intervals, and so a measurable rectangle is more 
general. As the measurable rectangles do include the ordinary ones and the 
latter generate &’, it follows that A? c Z' x P’. On the other hand, if A 
is an interval, [B: A X B € Z?] contains R! (A XR'=U,(A x(-n,n])) € 
P2?) and is closed under the formation of proper differences and countable 
unions; thus it is a o-field containing the intervals and hence the Borel sets. 
Therefore, if B is a Borel set, [A: A x B € &*| contains the intervals and 
hence, being a o-field, contains the Borel sets. Thus all the measurable 
rectangles are in 2?, and so Z’ X #' = consists exactly of the two- 


dimensional Borel sets. a 


As this example shows, 2X Y is in general much larger than the class of 
measurable rectangles. 


Theorem 18.1. (i) If E€ XX P, then for each x the set ly: (x, y) EE] 
lies in Y and for each y the set [x: (x, y) € E] lies in Z. 

(ii) If f is measurable 2X Y, then for each fixed x the function f(x, `) is 
measurable Y, and for each fixed y the function fC, y) is measurable QX. 


The set [y: (x, y) € E] is the section of E determined by x, and f(x,:)is 
the section of f determined by x. | 


Proor. Fix x, and consider the mapping T,: YX XY defined by 
T.y =(x,y). If E=AXB is a measurable rectangle, T; 'E is- B or Ø 


232 ; INTEGRATION 


according as A contains x or not, and in either case Ty 'E € Y. By Theorem 
13.1(i), T, is measurable 2/ 2x Y. Hence [y: (x, y) e E]=T_'E €% for 
E € ZX V. By Theorem 13.1(ii), if f is measurable 2X vid then fT, is 
measurable 2/2'. Hence f(x,:)=fT,(-) is measurable Y. The symmetric 
statements for fixed y are proved the same way. 5 


Product Measure 


Now suppose that (X, 2, u) and (Y, Y,v) are measure spaces, and suppose 
for the moment that u and v are finite. By the theorem just proved vi y: 
(x, y) € E] is a well-defined function of x. If -/ is the class of E in 2x Y 
for which this function is measurable Z, it is not hard to show that -⁄/ is a 
A-system. Since the function is 1,(x)v(B) for E =A XB, -£ contains the 
a-system consisting of the measurable rectangles. Hence coincides with 
XX Y by the T-A theorem. It follows without difficulty that 


(18.1) m(E)=fov[y:(x,y)eE]u(dr), EE2xy, 
X 
is a finite measure on Zx Y, and similarly for 


(18.2) m"(E) = f alx: (x,y) EE]v(4), EEX Y. 


For measurable rectangles, 
(18.3) m'(AXB)=T"(AXB)=u(A):v(B). 


The class of E in ZX V for which m'(E)=r"(E) thus contains the 
measurable rectangles; since this class is a A-system, it contains Z X X. TI > 
common value 7'(E) = 7r"(E) is the product measure sought. 

To show that (18.1) and (18.2) also agree for o-finite u and v, let {Am ja 
{B„} be decompositions of X and Y into sets of finite measure, and 

Uml A) =LA OA,,) and v,(B) = vB N B,). Since v(B) = L,,v,,(B), the in- 
ee ong in (18.1) is s measurable 2 inthe o-finite as well as in the finite case d 
hence 7’ is a well-defined measure on Zx Y and so is m”. If m, n and Tan 
are (18.1) and (18.2) for u„ and v,, then by the finite case, already. treated 
(see Example 16.5), 7(E) = ET! (E) = ET" (E) = 7"(E). Thus (18.0) 
and (18.2) coincide in the o-finite case as “well. Moreover, m'(A X. ) = 
Emn hml A), (B) = uw A)v(B). i 


Theorem 18.2. if (X, 2) and (Y, Vv) ame ee 
a(E)=7(E)=7"(E) defines a o-finite measure on £ 
measure such that (A X B) = u( A): v(B) for meas 


SECTION 18. PRODUCT MEASURE AND FUBINI’S THEOREM 233 


Proor. Only o-finiteness and uniqueness remain to be proved. The 
products A,, xB, for {A,,} and {B,} as above decompose X XY into 
measurable rectangles of finite 7-measure. This proves both o-finiteness and 
uniqueness, since the measurable rectangles form a m-system generating 
2x YW (Theorem 10.3). x 


The m thus defined is called product measure; it is usually denoted p X v. 
Note that the integrands in (18.1) and (18.2) may be infinite for certain x and 
y, which is one reason for introducing functions with infinite values. Note 
also that (18.3) in some cases requires the conventions (15.2). 


Fubini’s Theorem 


Integrals with respect to m are usually computed via the formulas 


asa) f FAE) = f| SAC yd ata) 


and 


ass) f fæ syalala y) = f| f F yude): 


The left side here is a double integral, and the right sides are iterated 
integrals. The formulas hold very generally, as the following argument shows. 
Consider (18.4). The inner integral on the right is 


(18.6) [fla yu). 


Because of Theorem 18.1(ii), for f measurable 2x Y the integrand here is 
measurable Y; the question is whether the integral exists, whether (18.6) is 
measurable Z as a function of x, and whether it integrates to the left side 
of (18.4). 

First consider nonnegative f. If f= Ip, everything follows from Theorem 
18.2: (18.6) is vf y: (x, y) € E], and (18.4) reduces to 7(E) = a'(E). Because 
of linearity (Theorem 15.1(iv)), if f is a nonnegative simple function, then 
(18.6) is a linear combination of functions measurable 2 and hence is itself 
measurable QZ: further application of linearity to the two sides of (18.4) 
shows that (18.4) again holds. The general nonnegative f is the monotone 
limit of nonnegative simple functions; applying the monotone convergence 
theorem to (18.6) and then to each side of (18.4) shows that again f has the 


properties required. 


d 


Thus for nonnegative f, (18.6) is a well-defined function of x (the value œ : 


is not excluded), measurable 2, whose integral satisfies (18.4). If one side Ao | 


234 INTEGRATION 


(18.4) is infinite, so is the other; if both are finite, they have the same finite 
value. 

Now suppose that f, not necessarily nonnegative, is integrable with 
respect to m. Then the two sides of (18.4) are finite if f is replaced by |f|, 
Now make the further assumption that 


(18.7) [F(x y)dy) < 


for all x. Then 
(18.8) f F(x, y)o(dy) = f f(x vod) - fF (x, ved). 


The functions on the right here are measurable Z and (since f*, f <|f) 
integrable with respect to u, and so the same is true of the function on the 
left. Integrating out the x and applying (18.4) to f* and to f~ gives (18.4) 
for f itself. 

The set A, of x satisfying (18.7) need not coincide with X, but 
u(X —A,)=0 if f is integrable with respect to m, because the function in 
(18.7) integrates to {|f|da (Theorem 15.2(iii)). Now (18.8) holds on Ao, 
(18.6) is measurable 2 on Ao, and (18.4) again follows if the inner integral 
on the right is given some arbitrary constant value on X — Ap. 

The same analysis applies to (18.5): 


Theorem 18.3. Under the hypotheses of Theorem 18.2, for nonnegative f 
the functions 


(18.9) [fle vray), J fl y)u(dr) 


are measurable X and Y, respectively, and (18.4) and (18.5) hold. If f (not 
necessarily nonnegative) is integrable with respect to m, then the two functions 
(18.9) are finite and measurable on Ay and on B), respectively, where u(X — 
Ay) = VY — By) = 0, and again (18.4) and (18.5) hold. 


It is understood here that the inner integrals on the right in (18.4) and 
(18.5) are set equal to 0 (say) outside A, and B,.' 

This is Fubini’s theorem; the part concerning nonnegative f is sometimes 
called Tonelli’s theorem. Application of the theorem usually follows a two-step 
procedure that parallels its proof. First, one of the iterated integrals is 
computed (or estimated above) with |f| in place of f. If the result is finite, 


; 


‘Since two functions that are equal almost everywhere have the same integral, the theory ¢ E 
integration could be extended to functions that are only defined almost everywhere; then Ag âl 
Bo would disappear from Theorem 18.3. > ie 


SECTION 18. PRODUCT MEASURE AND FUBINI’S THEOREM 235 


then the double integral (integral with respect to 7) of | f| must be finite, so 
that f is integrable with respect to 7; then the value of the double integral of 
f is found by computing one of the iterated integrals of f. If the iterated 
integral of |f] is infinite, f is not integrable r. 


Example 18.2. Let D, be the closed disk in the plane with center at the 
origin and radius r. By (17.12), 


A(D,) = ff dedy = f] pdpao. 
D, 


O<p<r 
0<0<27 


The last integral can be evaluated by Fubini’s theorem: 
r 
AMD) =2r| pdp =r. a 
0 


Example 18.3. Let I= (oe ah. By Fubini’s theorem applied in the 
plane and by the polar-coordinate formula, 


p= ffe drdy = ff e™®” pdp do. 
R? 


p>0 
0<0<27 


The double integral on the right can be evaluated as an iterated integral by 
another application of Fubini’s theorem, which leads to the famous formula 


(18.10) fe? de= va. 


— 0 


As the integrand in this example is nonnegative, the question of integrability 
does not arise. = 


Example 18.4. It is possible by means of Fubini’s theorem to identify the 
limit in (17.4). First, 


ae —e~“(usint + cos t)], 
7 


t A 
feck? sin x dx = 
0 1+ 


as follows by differentiation with respect to t. Since 


f| lesin xl du| de = fi xl l dx <t <0, 
o L-70 0 


236 INTEGRATION 


Fubini’s theorem applies to the integration of e “* sin x over (0, £) x (0, œ); 
‘sin x a -uy 
e e = e “* du | dx 
paea pa [ermal 
= ce sin xar du 
0 A 


& o pout j 
=f 1- f T 5 (usin t + cos t) du. 
0 u 0 u 


The next-to-last integral is 7/2 (see (17.9)), and a change of variable ut = s 
shows that the final integral goes to 0 as t > ©. Therefore, 


(18.11) lim | 
t>%œ70 


H 
dx= 7. a 


Integration by Parts 


Let F and G be two nondecreasing, right-continuous functions on an interval 
[a,b], and let u and v be the corresponding measures: 


ewm AO EE E Gy) G: a<x<y<b. 


In accordance with the convention (17.22) write dF(x) and dG(x) in place of 
u(dx) and v(dx). 


Theorem 18.4. If F and G have no common points of discontinuity in 
(a, b], then 


(18.12) i , CC) AEX) 
=F(b)G(b) = F(a)G(a) ~ f „ F(x) dG( x). 


In brief: [GdF = AFG — {FdG. This is one version of the partisi integ ri 
tion formula. 


Proor. Note first that replacing F(x) by F(x)—C leaves ( 
changed—it merely adds and subtracts Cv(a, b] on the right. I 
C = F(a)) it is no restriction to assume that F(x) = pla, x] a 
to assume that G(x) = x(a, x]. If m = X v is product mea 


SECTION 18. PRODUCT MEASURE AND FUBINI’S THEOREM 237 
then by Fubini’s theorem, 
(18.13) m[(x,y):a<y<x<b] 
= v(a, Xx |u( ay) = G(x) dF(x 
J, ple lulde) = f G(x) dF(x) 
and 


(18.14) m[(x,y):a<x<y<b] 
= J, yeas red) =f PO) aG0). 


The two sets on the left have as their union the square S = (a, b] x (a, b]. 
The diagonal of S$ has m-measure 


m|(x,y):a<x=y<b] =) PAE ESL 


because of the assumption that u and v share no points of positive measure. 
Thus the left sides of (18.13) and (18.14) add to m(S)= (a, b]v(a, b] = 
F(b)G(b). E 


Suppose that v has a density g with respect to Lebesgue measure and let 
G(x)=c + [7g(t) dt. Transform the right side of (18.12) by the formula 
(16.13) for integration with respect to a density; the result is 


(18.15) r „E dF(x) 


= F(b)G(b) — F(a)G(a) - E dx. 


A consideration of positive and negative parts shows that this holds for any g 
integrable over (a, b]. 

Suppose further that has a density f with respect to Lebesgue measure, 
and let F(x) =c' + {7 f(t) dt. Then (18.15) further reduces to 


(18.16) [PGi f(x) dx = F(b)G(b) — F(a)G(a) — J Fela) dx. 


Again, f can be any integrable function. This is the classical formula for 
integration by parts. p" 

Under the appropriate integrability conditions, (a, b] can be replaced t 
an unbounded interval. a Be. 


238 INTEGRATION 


Products of Higher Order 


Suppose that (X, Z, u), (Y, X, v), and (Z, J, n) are three o-finite measure 
spaces. In the usual way, identify the products X x Y X Z and (X XY) XZ. 
Let ZX YX Y be the o-field in X X YxZ generated by the AXBXC 
with A, B,C in Z, Y, 9, respectively. For C in Z, let GY. be the class of 
Ee Zx & for which EX CE 2X YX Y. Then & is a o-field containing 
the measurable rectangles in X X Y, and so &= AX Y. Therefore, (Xx 
UY) X PO RX YX J. But the reverse relation is obvious, and so (Zx Y) 
X J= 2X YX Y. 

Define the product u Xv X 7 on RX YX PY as (u Xv) X 7. It gives to 
AXBXC the value (u Xv) (A XB): n(C) = w(A)v(B)n(C), and it is the 
only measure that does. The formulas (18.4) and (18.5) extend in the obvious 
way. 

Products of four or more components can clearly be treated in the same 
way. This leads in particular to another construction of Lebesgue measure in 
R*=R!X --- XR! (see Example 18.1) as the product A X -+ XA (k fac- 
tors) on AK = Z! X --- X '. Fubini’s theorem of course gives a practical 
way to calculate volumes: 


Example 18.5. Let V, be the volume of the sphere of radius 1 in R*; by 
Theorem 12.2, a sphere in R* with radius r has volume r*V,. Let A be the 
unit sphere in R*, let B=[(x,,x,): x?+x3<1], and let CCEE 


Karacs te Dg... He <1 — x; x]. By Fubinijs theorem: 
V,= Í dx, +- de, = f de, dx dx, >- dx 
í f i . I, as (NY 


= {andes 0—1 ane 
B 


tan. ff (pry? ie 


0<0<27 
0<p<1 


1 2TrV, 
=aV,-of pk-2/2 dy = To 


If Vo is taken as 1, this holds for k = 2 as well as for k > 3. Since V, = 2, it 
follows by induction that 


(2m) (27)! 


VY... = — SS i ee ee 
Umi GREK SK + XLE Vai 2x4x--: X (2i) 


for i = 1,2,... . Example 18.2 is a special case. 


SECTION 18. PRODUCT MEASURE AND FUBINI’S THEOREM 239 


PROBLEMS 


18.1. 


18.2. 


18.6. 


18.7. 


18.8. 


18.9, 


Show by Theorem 18.1 that if A XB is nonempty and lies in 2x Y, then 
AE Band BEF. 


2.97 Suppose that X= Y is uncountable and Z= X consists of the count- 
able and the cocountable sets. Show that the diagonal E = [(x, y): x =y] does 
not lie in ZX Y, even though [y: (x, y)E Ele & and [x: (x, WEFlEeL 
for all x and y. 


- 10.5 18.17 Let (X, Z, u)=(Y,?%,v) be the completion of (R',F',A). 


Show that (X x Y, ZX V, u X v) is not complete. 


. The assumption of ø-finiteness in Theorem 18.2 is essential: Let u be Lebesgue 


measure on the line, let v be counting measure on the line, and take 
E = [(x, y): x =y]. Then (18.1) and (18.2) do not agree. 


. Example 18.2 in effect connects 7 as the area of the unit disk D, with the a 


of trigonometry. 
(a) A second way: Calculate ,(D,) directly by Fubini’s theorem: A(D,) = 
[1,201 —x?)'/? dx. Evaluate the integral by trigonometric substitution. 


(b) A third way: Inscribe in the unit circle a regular polygon of n sides. Its 
interior consists of n congruent isosceles triangles with angle 27/n at the 
apex; the area is n sin(m/n)cos(m/n), which goes to r. 


Suppose that f is nonnegative on a o-finite measure space (Q, F, u). Show 
that 


J fdu=(w xAaj[(w,y) €OXR'O<y <f(w)]. 


Prove that the set on the right is measurable. This gives the “area under the 
curve.” Given the existence of u X A on Q xX R', one can use the right side of 
this equation as an alternative definition of the integral. 


Reconsider Problem 12.12. 


Suppose that v[y: (x, y)€E]=vly: (x,y) F] for all x, and show that 
(u Xv KE) =(u Xv F). This is a general version of Cavalieri’s principle. 


(a) Suppose that yw is o-finite, and prove the corollary to Theorem 16.7 by 
Fubini’s theorem in the product of (Q, F,u) and {1,2,...} with counting 
measure. 

(b) Relate the series in Problem 17.7 to Fubini’s theorem. 


240 


18.10. 


18.11. 


18.12. 


18.13. 


18.14. 


18.15. 


18.16. 


18.17. 


18.18. 


INTEGRATION 


(a) Let u =v be counting measure on X = Y= (1,2, 00} 08 


2-2" if x=y, 
Ae y)= -242% faya 
0 otherwise, 


then the iterated integrals exist but are unequal. Why does this not contradict 


Fubini’s theorem? 
(b) Show that xy/(x? + y?)? is not integrable over the square Cx, y): Ixl lyl < 
1] even though the iterated integrals exist and are equal. 


Exhibit a case in which (18.12) fails because F and G have a common point of 
discontinuity. 


Prove (18.16) for the case in which all the functions are continuous by 
differentiating with respect to the upper limit of integration. 


Prove for distribution functions F that [° {F(x +c) — F(x))dx =c. 
Prove for continuous distribution functions that [? „F(x)dF(x) = 3. 


Suppose that a number f, is defined for each n>=n, and put F(x) = 


Lino <n<xtn- Deduce from (18.15) that 
(18.17) E G(n)f,=F(x)G(x) — | F(a (t) dt 


if G(y) — G(x) = {2g(t) dt, which will hold if G has continuous derivative g. 
First assume that the f„ are nonnegative. 


t Take no=1, f, =1, and G(x)=1/x, and derive ©, „n~! = log x+y + 
O(1/x), where y = 1 -— f(t —|t])t~? dt is Euler’s constant. 


5.20 18.15T Use (18.17) and (5.51) to prove that there exists a constant € 
such that 
(18.18) y EA an Aa ankak, 

bey log x J’ 


Euler’s gamma function is defined for positive t by T(t) = [2x'~e~* dx. 

(a) Prove that T(t) = ffx’ (log x)*e~* dx. 

(b) Show by partial integration that T(t + 1) =¢T(t) and hence that T(n + D 
=n! for integral n. 

(c) From (18.10) deduce T(}) = yr. 

(d) Show that the unit sphere in R* has volume (see Example 18.5) 

rk? 


(18.19) V, = TETEE 


SECTION 19. THE L?’ SPACES 241 


18.19. By partial integration prove that fọ(sin x)/x) dx =m/2 and fZ4{1- 
cos x)x ~? dx = m. 


18.20. Suppose that u is a probability measure on (X, Z) and that, for each x in X, 
v, is a probability measure on (Y, %). Suppose further that, for each B in Y, 
v,(B) is, as a function of x, measurable 2. Regard the u(A) as initial 
probabilities and the v,(B) as transition probabilities. 


(a) Show that, if E E€ ZX & then v ly: (x, y) € E] is measurable Z. 


(b) Show that 7(E) = fyv ly: (x, y) E€ E]u(dx) defines a probability measure 
on ZX X. If v, =v does not depend on x, this is just (18.1). 


(c) Show that if f is measurable 2x Y and nonnegative, then fy f(x, y)v,(dy) 
is measurable 2. Show further that 


Sod Wy GUE) MINE yn (d) uco, 


which extends Fubini’s theorem (in the probability case). Consider also f’s that 
may be negative. 


(d) Let v(B) = fyv,(B)u(dx). Show that 7(X x B) = v(B) and 


[soma OCC 


SECTION 19. THE L’? SPACES* 


Definitions 


Fix a measure space (Q, F, u). For 1 <p <œ, let L? = LP(Q, F, u) be the 
class of (measurable) real functions f for which |f|? is integrable, and for f 
in LP, write 


(19.1) isle=[fislran] 


There is a special definition for the case p = œ: The essential supremum of f 
is 


(19.2) IIf llc = inf[a: w[w:|f(w)|>a] =0]; 


take L” to consist of those f for which this is finite. The spaces L? have a 
geometric structure that can usefully guide the intuition. The basic facts are 
laid out in this section, together with two applications to theoretical statistics. 


“The results in this section are used nowhere else in the book. The proofs require some 
elementary facts about metric spaces, vector spaces, and convex sets; and in one place the 
Radon-Nikodym theorem of Section 32 is used. As a matter of fact, it is possible to jump ahead 
and read Section 32 at this point, since it makes no use of Chapters 4 and 5. 


242 INTEGRATION 


For 1 < p,q <œ, p and q are conjugate indices if p '+q~'=1; p and q 
are also conjugate if one is 1 and the other is œ (formally, 1~' + œ~! = 1). 
Holder’s inequality says that if p and q are conjugate and if fEL? and 


g © L9, then fg is integrable and 


(19.3) |ffedu| < fifel du <Ill: 


This is obvious if p=1 and q =œ. If 1<p,q<œ and u is a probability 
measure, and if f and g are simple functions, (19.3) is (5.35). But the proof 
in Section 5 goes over without change to the general case. 

Minkowski’s inequality says that if f, g E LP (1 <p < œ), then f +g ELP 
and 


(19.4) llf+ ell, <Ilf ll, + llgll,- 
This is clear if p = 1 or p = œ. Suppose that 1 < p < œ. Since |f + g| < 2(/f|? 


+|g|")'/?, f+g does lie in L”. Let q be conjugate to p. Since p —1=p/q, 
Hölder’s inequality gives 


f+ elle = fIf+ gl” du 
<fifl-lftel?/“du + flel-lf+el?/4 dy 
<iifllo M+ 81 le + lela f+ gl 
1/4 
= (fle +l} fir s1” du] 
=(IIfll + llgll, IF + gll2/*. 


Since p — p/q = 1, (19.4) follows.* 
If æ is real and f € L”, then obviously a feL? and 


(19.5) lafllp =lal-Ilfil,- 


_ Define a metric on L? by d(f,g)=Ilf—gllp. Minkowski’s inequality 
gives the triangle inequality for di and d, is certainly symmetric. Further, 
d,(f,g)=0 if and only if |f—g|” integrates to 0, that is, f=g almost 
everywhere. To make L?” a metric space, identify functions that are equal 
almost everywhere. 


h P S ta 
‘The Hölder and Minkowski inequalities can also be proved by convexity arguments; see 
Problems 5.9 and 5.10. a 


SECTION 19. THE L?’ SPACES 243 


If llf—f,ll, 0 and p <œ, so that F=f du > 0, then fa is said to 
converge to f in the mean of order p. 

If f=f' and g =g' almost everywhere, then f+g=f'+g' almost every- 
where, and for real a, af =af' almost everywhere, In L”, f and f’ become 
the same function, and similarly for the pairs g and g', f+g and f' +g’, and 
af and af’. This means that addition and scalar multiplication are unam- 
biguously defined in L”, which is thus a real vector space. It is a normed 


vector space in the sense that it is equipped with a norm || ||, satisfying (19.4) 
and (19.5). 


Completeness and Separability 


A normed vector space is a Banach space if it is complete under the 
corresponding metric. According to the Riesz—Fischer theorem, this is true 
of LL”: 


Theorem 19.1. The space L” is complete. 


Proor. It is enough to show that every fundamental sequence contains a 
convergent subsequence. Suppose first that p < æ. Assume that || f, Welles 
0 as m,n > œ, and choose an increasing sequence {n,} so that || f, S 
2-(P+DK for m,n>n,. Since flfn -fl du zae —f,|>a] (this is 
just a general version of Markov’s inequality (5.31)), ullf, — fal > 27%] < 
2°"\\f,,§—talp=2 S for m,n >n,. Theretore, 2 allie fel 2m ee 
and it follows by the first Borel—Caatelli lemma (which works for arbitrary 
measures) that, outside a set of u-measure 0, Dee Pilea converges. 
But then i converges to some f almost everywhere and by Fatou’s 
lemma, /|f —f,,|’ du <liminf;, /|f,,—f,,|° du <2“. Therefore, f €L? and 
If- Ap > 0, as required. 

If p=, choose {n,} so that lfa -falo <2 < for m,n>n,. Since 
Ifn,., ~fn,| <2 * almost everywhere, f, converges to some f, and |f — fa l < 
2~* almost everywhere. Again, ie wey lia 210! 5 


The next theorem has to do with separability. 


m 


Theorem 19.2. (i) Let U be the set of simple functions Lj" \a;l, for a; 
and u(B,) finite. For 1 <p < œ, U is dense in L”. 

(ii) If u is o-finite and F is countably generated, and if p < ©, then L? is 
separable. 


Proor. Proof of (i). Suppose first that p < œ. For f € L’, choose (Theo- 
rem 13.5) simple functions f,, such that f, >f and fal < Ifl. Then f, ela 
and by the dominated convergence theorem, /|f—f,|’ du — 0. Therefore, i 


244 INTEGRATION 


li allie <e for some n; but each f, is in U. As for the case p =o, if 
n2">\lf\l., then the f, defined by (13.6) satisfies lf —fllo <2" (<e for 
large n). 

Proof of (ii). Suppose that F is generated by a countable class @ and 
that Q is covered by a countable class 2 of Asets of finite measure. Let 
E,, E>,... be an enumeration of @UQ; let Z, (n= 1) be the partition 
consisting of the sets of the form F; A ++: A Fa, where F; is E; or Ef; and let 
F, be the field of unions of the sets in Y, Then A= U,_,F, isa 
countable field that generates F, and u is o-finite on A). Let V be the set 
of simple functions £”".,a,1, for a; rational, A; E Yo, and w(A;) < œ. 

Let g = Li ,a,I, be the element of U constructed in the proof of part (i). 
Then || f—gll, <e, the a, are rational by (13.6), and any a; that vanish can 
be suppressed. By Theorem 11.4(ii), there exist sets A; in A, such that 
u(B,AA;)<(e/mla,))”, provided p<, and then h = Lit ,a;1, lies in V 
and || f — All, < 2e. But V is countable.* a 


Conjugate Spaces 


A linear functional on L?” is a real-valued function y such that 


(19.6) y(af+a'f') =ay(f)+a'y(f'). 
The functional is bounded if there is a finite M such that 
(19.7) ly(f)l<sM- If ll, 


for all f in L”. A bounded linear functional is uniformly continuous on L? 
because || f—f'll,<e¢/M implies ly(f)—y(fl<e (if M>0; and M=0 
implies y( f) = 0). The norm |ly|| of y is the smallest M that works in (19.7): 
lyll = suplly(fPDV/Ifll,: f # Ol. 


Suppose p and q are conjugate indices and g € L1. By Hdlder’s inequality, 


(19.8) ve f) = ffedu 


is defined for f EL?” and satisfies (19.7) if M=llgll,; and y, is obviously 
linear. According to the Riesz representation theorem, this is the most general 
bounded linear functional in the case p < œ: 


Theorem 19.3. Suppose that u is o-finite, that 1 <p < œ, and that q is 
conjugate to p. Every bounded linear functional on LP has the form (19.8) for 
some g © L’; further, 

(19.9) llygll =llella, 


and g is unique up to a set of -measure 0. 


t Part (ii) definitely requires p < œ; see Problem 19.2. 


SECTION 19. THE L?’ SPACES 245 


The space of bounded linear functionals on L” is called the dual space, or 
the conjugate space, and the theorem identifies L? as the dual of L”. Note 
that the theorem does not cover the case p = œ.t 


Proor. Case I: u finite. For A in ¥, define g(A) = y(/,). The linearity 
of y implies that ¢ is finitely additive. For the M of (19.7), |p(A)| <M: IILillp 
=M-|u(A)'’”. If A=U,,A,, where the A,, are disjoint, then (A) = 
Ln- 19(A,) + EUn yA,), and since leU ns NAE MEU os A) 
0, it follows that ¢ is an additive set function in the sense of (32.1). 

The Jordan decomposition (32.2) represents as the difference of two 
finite measures ¢* and gy with disjoint supports A* and A~. If u(A)= 0, 
then ¢*(A)=¢(A NA+) < Mu'/(A) =0. Thus ọ* is absolutely continu- 
ous with respect to u and by the Radon-Nikodym theorem (p. 422) has an 
integrable density g*: p*(A)= f4g* du. Together with the same result for 
e , this shows that there is an integrable g such that y(I,) =¢(A) = 
[48du = fl,gdpu. Thus y(f)= ffgdu for simple functions f in L?. 

Assume that this g lies in L4, and define y, by the equation (19.8). Then 
y and y, are bounded linear functionals that agree for simple functions; 
since the latter are dense (Theorem 19.2(i)), it follows by the continuity of y 
and y, that they agree on all of L”. It is therefore enough (in the case of 
finite u) to prove g € L^. It will also be shown that llgllz is at most the M of 
(19.7); since ||g||, does work as a bound in (19.7), (19.9) will follow. If 
y,(f) = 0, (19.9) will imply that g = 0 almost everywhere, and for the general 
y it will follow further that two functions g satisfying ye f) = y(f) must 
agree almost everywhere. 

Assume that 1 < p,q < œ. Let g,, be simple functions such that 0 <g, 1 |glf, 
and take h, =g} sgn g. Then h,g =g,/"lgl>g,/"g)/4 =g,, and since h, 
is simple, it follows that fg, du < fh,gdu=y,(h,)=y(h,) <M -|lh, ||, = 
MI fg, du]'/. Since 1-— 1/p = 1/q, this gives [ fg, du]'/7 <M. Now the 
monotone convergence theorem gives g € L?’ and even ||g||, < M. 

Assume that p = 1 and q =~. In this case, | /fgdu|=ly,(f l=ly(fl<M- 
ll fll, for simple functions f in L'. Take f = sgn g: Ine\zay Phen aullg|> a) < 
Wea a) lgldu = ffgdu <M -|lfll1=Mullgl|>a]. If a>M, this inequality 
gives allel > a] = 0; therefore ||gll. =Ilgll; <M and g EL” = L^. 

Case II: u o-finite. Let A,, be sets such that A, 1 and p(A,)<o. If 
H(A) =p(AA,), then ly(fl, |< M-Ilfly llp=M-L/\fl’du,]'/? for fe 
LPC, E UEA (ib, W By the finite case, A,, supports a g,, in L? such 
that y(fl, )= ffl, g, du for fe L”, and |lg, ||, < M. Because of uniqueness, 
£,,, can be taken to agree with g, on A, (L?(u,,,)CL'(u,)). There is 
therefore a function g on Q such that g=g, on A, and Ia, 8llq <M. It 
follows that ||gl <M and g € L°. By the dominated convergence theorem 
and the continuity of y, f@L? implies {fg du = lim, ffI 4,8 dp= 
lim, y( fl, ) = y(f). Uniqueness follows as before. B 


"Problem 19.3. 


246 INTEGRATION 


Weak Compactness 


For f LP and g E€ L^, where p and q are conjugate, write 


(19.10) (f.g) = | fedu. 


For fixed f in L”, this defines a bounded linear functional on Lf; for fixed g 
in L%, it defines a bounded linear functional on L”. By Holder’s inequality, 


(19.11) (F, g)l<Ilfllpllgll- 


Suppose that f and f, are elements of L”. If (f, g) =lim,(f,,, 8) for each 
g in L%, then f, converges weakly to f. If \|f—f,ll, > 0, then certainly f, 
converges weakly to f, although the converse is false. 

The next theorem says in effect that if p> 1, then the unit ball B? = 
[fE LP: ||fllp < 1] is compact in the topology of weak convergence. 


Theorem 19.4. Suppose that u is o-finite and F is countably generated. If 
1 <p < œ, then every sequence in B? contains a subsequence converging weakly 
to an element of BP. 

Suppose of elements f,,, f, and f' of L? that f, converges weakly both to f 
and to f'. Since, by hypothesis, u is o-finite and p > 1, Theorem 19.3 applies if 
the p and q there are interchanged. And now, since (f, g)=(f', g) for all g in 
L’, it follows by uniqueness that f = f'. Therefore, weak limits are unique under 
the present hypothesis. The assumption p > 1 is essential.* 


Proor. Let q be conjugate to p (1<q<~). By Theorem 19.2(ii), L4 
contains a countable, dense set G. Add to G all finite, rational linear 
combinations of its elements; it is still countable. Suppose that { f) cat 

By (19.11), Kf, g)|<llgll; for g €L. Since {(f,, g)} is bounded, it is 
possible by the diagonal method [A14] to pass to a subsequence of {f,} along 
which, for each of the countably many g in G, the limit lim,(f,,, 8) = y(g) 
exists and |y(g)| <|lgll,. For g, g' € G, ly(g) — y(g’)|=lim, „Kf, g- 89l < llig 
~ g'll,- Therefore, y is uniformly continuous on G and so has a unique 
continuous extension to all of L°. For g, g' € G, y(g + g')=lim,(f,,g +8) 
= y(g)+ y(g'); by continuity, this extends to all of L4. For g€G anda 
rational, y(ag) = a lim,(f,,, g) = ay(g); by continuity, this extends to all real _ 
a and all g in L°: y is a linear functional on L*. Finally, |y(g)|<llgllo 
extends from G to L° by continuity, and y is bounded in the sense of (19.7). 


tproblem 19.4. 
+problem 19.5. 


SECTION 19. THE L?’ SPACES 247 


By the Riesz representation theorem (1 <q < œ), there is an f in L” (the 
space adjoint to L1) such that y(g)=(f, g) for all g. Since y has norm at 
most 1, (19.9) implies that || fll, < 1: f lies in BP. 

Now (f,g)=lim,(f,,g) for g in G. Suppose that g’ is an arbitrary 
element of L°, and choose g in G so that ||g' — g|l, < €. Then 


KFE) AEN r 
<Kf,8')-(AeltKh 8) =n ENEN N 
<Ilfllplle’—sllat+\(f.8) — (fas 8)|+Wfallolle — g'la 
<2e+K fie) —( fas ®)i 


Since g € G, the last term here goes to 0, and hence lim,(f,,, g') = (f, g’) for 
all g'in L°. Therefore, f,, converges weakly to f. a 


Some Decision Theory 


The weak compactness of the unit ball in L” has interesting implications for statistical 
decision theory. Suppose that u is o-finite and Z is countably generated. Let 
fiia f,, be probability densities with respect to 4—nonnegative and integrating to 
1. Imagine that, for some i, w is drawn from Q according to the probability measure 
P(A)= [,f, du. The statistical problem is to decide, on the basis of an observed w, 
which f; is the right one. 

Assume that if the right density is f;, then a statistician choosing f; incurs a 
nonnegative loss L(j|i). A decision rule is a vector function 5(w) = (5,(@),..., 5,(@)), 
where the ô;(w) are nonnegative and add to 1: the statistician, observing w, selects f; 
with probability 5,(w). If, for each w, 5,(w) is 1 for one i and 0 for the others, ô is a 
nonrandomized rule; otherwise, it is a randomized rule. Let D be the set of all rules. 
The problem is to choose, in some more or less rational way that connects up with the 
losses L(j|i), a rule 6 from D. 

The risk corresponding to 6 and f; is 


R,() = | [Eor youto), 


which can be interpreted as the loss a statistician using ô can expect if f; is the right 
density. The risk point for 6 is R(8)=(R (8), ..., R,(5)). If R;(8') < R;(ô) for all i 
and R;(8') < R;(8) for some i—that is, if the point R(ô') is “southwest” of R(8)—then 
of course 8’ is taken as being better than 5. There is in general no rule better than all 
the others. (Different rules can have the same risk point, but they are then indistin- 
guishable as regards the decision problem.) , 

The risk set is the collection § of all the risk points; it is a bounded set in the first 
orthant of R*. To avoid trivialities, assume that § does not contain the origin (as 
would happen for example if the L(j|i) were all 0). 

Suppose that ô and 6’ are elements of D, and A and A’ are nonnegative and add to 
1. If 8"(w) =A5,(w) + NSi(w), then ô” is in D and R(5")=AR(S) + A’R(3'). There- 
fore, S is a convex set. ' 

Lying much deeper is the fact that S$ is compact. Given points x in $, choose — 
rules 6 such that R(6() =x. Now 6(-) is an element of L”, in fact of BT, and 
so by Theorem 19.4 there is a subsequence along which, for each j =1,...,k, 


248 INTEGRATION 


converges weakly to a function 5, in BY. If p(A)<œ, then (6,1, dp = 
lim, (51, du > 0 and (1 —¥,6,)4 du = lim, SO — £,5{")14 du = 0, so that 6; > 0 
and ) 6; = 1 almost everywhere on A. Since u is o-finite, the ô; can be altered on a 
set of u-measure 0 in such a way as to ensure that ô = (6),... ôx) is an element of D. 
But, along the subsequence, x‘ > R(5). Therefore: The risk set is compact and 
convex. 

The rest is geometry. For x in R*, let Q, be the set of x' such that 0 < x; <x, for 
all i. If x = R(S) and x’ = R(8’), then ô' is*better than ô if and only if x’€Q, and 
x' x. A rule 8 is admissible if there exists no 5’ better than 5; it makes no sense to 
use a rule that is not admissible. Geometrically, admissibility means that, for x = R(6), 
SOQ, consists of x alone. 


H3 


Let x=R(ô) be given, and suppose that ô is not admissible. Since $ N Q, is 
compact, it contains a point x’ nearest the origin (x' unique, since § N Q, is convex as 
well as compact); let 5’ be a corresponding rule: x’ = R(ô'). Since ô is not admissible, 
x' +x, and 6’ is better than 6. If S O Q, contained a point distinct from x’, it would 
be a point of S N Q, nearer the origin than x’, which is impossible. This means that 
Q, contains no point of S other than x’ itself, which means in turn that 6’ is 
admissible. Therefore, if 5 is not itself admissible, there is a 5’ that is admissible and 
is better than 6. This is expressed by saying that the class of admissible rules is 
complete. 

Let p =(pj,..., Pk) be a probability vector, and view p; as an a priori probability 
that f; is the correct density. A rule 6 has Bayes risk R(p,8)= £; p;R;(8) with 
respect to p. This is a kind of compound risk: f; is correct with probability p;, and the 
statistician chooses f; with probability 6,(w). A Bayes rule is one that minimizes the 
Bayes risk for a given p. In this case, take a = R(p,5) and consider the hyperplane 


(19.12) H= |z Epzi=al 
i 
and the half space 


(19.13) Ht= |z Erz za]. 


Then x= R(ô) lies on H, and S is contained in H*: x is on the boundary of S 


SECTION 19. THE L? SPACES 249 


H is a supporting hyperplane. If p; > 0 for all i, then Q, meets S only at x, and so 6 
is admissible. 

Suppose now that ô is admissible, so that x = R(5) is the only point in $ O Q, and 
x lies on the boundary of S. The problem is to show that ô is a Bayes rule, which 
means finding a supporting hyperplane (19.12) corresponding to a probability vector 
p. Let T consist of those y for which Q, meets S. Then T is convex: given a convex 


combination y” =Ay + A’y’ of points in T, choose in S points z and z' southwest of y 
and y’, respectively, and note that z”=Az-+A’z’ lies in § and is southwest of y”. 
Since $ meets Q, only in the point x, the same is true of T, so that x is a boundary 
point of T as well as of S$. Let (19.12) (p # 0) be a supporting hyperplane through x: 
x €H and TOH”. If p; <0, take Zi =X; + 1 and take z; =x, for the other i; then z 


lies in T but not in H, a contradiction. (The right- hari figure shows the role of 7: 
the planes H, and H, both support S$, but only H, supports T and only H, 
corresponds © a probability vector.) Thus p; > 0 for all i, and since Ł;p;= 1 can be 
arranged by normalization, ô is indeed a Bayes rule. Therefore: The admissible rules 
are Bayes rules, and they form a complete class. 


The Space L? 


The space L? is special because p = 2 is its own conjugate index. If f, g € L’, 
the inner product (f, g) = [fg dy is well defined, and by (19.11)—write || f || in 
place of || flla—I(f, gl <IIfll-Ilgll. This is the Schwarz (or Cauchy—Schwarz) 
inequality. If one of f and g is fixed, (f, g) is a bounded (hence continuous) 
linear functional in the other. Further, (f, g)=(g, f), the norm is given by 
Ifl =(f,f) and L? is complete under the metric d(f,g)=llf—gll. A 
Hilbert space is a vector space on which is defined an inner product having all 
these properties. 

The Hilbert space L? is quite like Euclidean space. If (f, g)=0, then f 
and g are orthogonal, and orthogonality is like perpendicularity. If f,,..., fn 
are orthogonal (in pairs), then by linearity, È; f L; f;) = EDO = 
Sal Al f,ll?. This is a version of the EET theorem. If 
f and g are oa Al write f ge ROnevery ma 30: 

Suppose now that p is o-finite and Z is countably generated, so that i? 
is separable as a metric space. The construction that follows gives a sequence 
(finite or infinite) ~,,~,,... that is orthonormal in the sense that |lg,,|| = 1 for 
all n and (¢,,, ¢,) = 0 for m +n, and is complete in the sense that (f, 9,,) = 0 
for all n implies f = 0—so that the orthonormal system cannot be ee 
Start with a sequence f,,f,,... that is dense in L’. Define gr 83434 
inductively: Let g, =f. Suppose that g,,...,g, have been defined and y 
orthogonal. Define g,,, =f, ,, — Ef-1@ni8p Where a,; is (fayo 8)/llg;ll? i 
g;+ 0 and is arbitrary if g; =0. Then g„+ı İS orthogonal to g),...,g,, K 
f,,, is a linear combination of g,» -, 8n +1: This, the Gram-Schmidt method, 
gives an orthogonal sequence g,,8>,... With the property that the finite 
linear combinations of the g, include all the f, and are therefore dense in 
L’ If g, #0, take ¢, =8n/ll8nll; if 8, = 0, discard it from the sequence. Then _ 
Pi Pogena IS orthonormal, and the finite linear combinations of the Pn 
still dense. It can happen that all but finitely many of the g, are 0, in y 


250 INTEGRATION 


case there are only finitely many of the ¢,. In what follows it is assumed that 
P1,P>,... is an infinite sequence; the finite case is analogous and somewhat 
simpler. 

Suppose that f is orthogonal to all the 9,. If a; are arbitrary scalars, then 
f,a,@,,---,4,9, is an orthogonal set, and by the Pythagorean property, 


If- E agil? = IfI? + Era? = lfl. If fl> 0, then f cannot be ap- 
proximated by finite linear combinations of the p„, a contradiction: 9,, 93,... 
is a complete orthonormal system. 

Consider now a sequence a,,a5,... of scalars for which L7_,a? converges. 
If s, = £"_,a,;, then the Pythagorean theorem gives lls, — Spall” = Em <j <n 42. 
Since the scalar series converges, {s,} is fundamental and therefore by 
Theorem 19.1 converges to some g in L?. Thus g = lim, /_ ,a;9;, which it is 
natural to express as g = L?_,a,9;. The series (that is to say, the sequence of 
partial sums) converges to g in the mean of order 2 (not almost everywhere). 
By the following argument, every element of L* has a unique representation 
in this form. 

The Fourier coefficients of f with respect to {y,} are the inner products 
ai= (frg). For each n, 0<\If — £%_,a,9,lI? =IIfIl? — 22,4, ©) + 
Laa (p; p;)= |f I7- L"_,a?, and hence, n being arbitrary, £?_,a? < || f ||”. 
By the argument above, the series L*_ :a;9; therefore converges. By linearity, 
(f— 27. 14;9;,0))=0 for n>j, and by continuity, (f— L7_,a;9,, 9) =0. 
Therefore, f — L;_,a,, is orthogonal to each g, and by completeness must 
be 0: 


(19.14) fC Ne) x 


i=] 


This is the Fourier representation of f. It is unique because if f = L?_,4;¢9; is 
0 (La? < œ), then a; =(f, ¢;) = 0. Because of (19.14), {g,} is also called an 
orthonormal basis for L?. 

A subset M of L? is a subspace if it is closed both algebraically ( f.f EM 
implies af + a'f' EM) and topologically (f, € M, f, >f implies feM). If 
L? is separable, then so is the subspace M, and the construction above 
carries over: M contains an orthonormal system {¢,} that is complete in M, in 
the sense that f= 0 if (f, ¢„) = 0 for all n and if f E€ M. And each f in M has 
the unique Fourier representation (19.14), Even if f does not lie in M, 
ri -1(f, pi) converges, so that L7_,(f, p;p; is well defined. 

This leads to a powerful idea, that of orthogonal projection onto M. For an 
orthonormal basis {y;} of M, define Py f = L?_,(f, pọ; for all f in L? (not 
just for f in M). Clearly, Py f € M. Further, f— L?_ (S, ee, L p; for n>j 
by linearity, so that f— Py f L p; by continuity. But if f—P,,f is orthogonal 
to each g,, then, again by linearity and continuity, it is orthogonal to the 
general element L;_,b,y; of M. Therefore, Pyf E M and f—Pyf .M. The 
map f — Puf is the orthogonal projection on M. 


SECTION 19. THE L?’ SPACES 251 
The fundamental properties of Py are these: 


(i) gEM and f—g 1M together imply g = Pu f; 
(ii) fE M implies P,, f =f; 
(ili) g E€ M implies || f — gll> If — Py, fil; 
(iv) Pylaf+a'f')=aPyf+a'Py f". 


Property (i) says that Py, f is uniquely determined by the two conditions 
Pyf <M and f- Pu- M. To prove it, suppose that g,g' EM, f-giM, 
and f—g' 1M. Then g-g'EM and g-2'1 M, so that g —g’ is orthogo- 
nal to itself and hence ||g—g'l|7=0: g=g'. Thus the mapping P,, is 
independent of the particular basis {y,}; it is determined by M alone. 

Clearly, (ii) follows from (i); it implies that Py is idempotent in the sense 
that Py, f = Py f. As for (iii), if g lies in M, so does Puf — g, so that, by the 
Pythagorean relation, || f — g||* = If- Pu fil’ +||P,,f—agll’= If- Pu fill’; the 
inequality is strict if g#P,,f. Thus Pyf is the unique point of M lying 
nearest to f. Property (iv), linearity, follows from (i). 


An Estimation Problem 


First, the technical setting: Let (Q, F, u) and (@,&,7) be a o-finite space and a 
probability space, and assume that ¥ and £ are countably generated. Let f,(w) be a 
nonnegative function on ®© x Q, measurable €x F, and assume that lo foo) (dw) 
= 1 for each 0 € ©. For some unknown value of 6,@ is drawn from Q according to 
the probabilities P(A) = f4 f4(w)u(dw), and the statistical problem is to estimate the 
value of g(@), where g is a real function on ©. The statistician knows the functions 
f(-) and g(-), as well as the value of w; it is the value of 0 that is unknown. 

` For an example, take Q to be the line, f(w) a function known to the statistician, 
and f,(w)=af(aw +B), where 0 = (a, ß) specifies unknown scale and location 
parameters; the problem is to estimate g(@)=a, say. Or, more simply, as in the 
exponential case (14.7), take f,(w) =af(aw), where 0 = g(0) =a. 

An estimator of g(@) is a function t(w). It is unbiased if 


(19.15) Jo) folo)u(do) = (8) 


for all 6 in ® (assume the integral exists); this condition means that the estimate is on 
target in an average sense. A natural loss function is (t(w) — g(@))?, and if fe is the 
correct density, the risk is taken to be /,(t(w) — g(0))*f,(w)u(de). 

If the probability measure m is regarded as an a priori distribution for the 
unknown @, the Bayes risk of t is 


(19.16) R(1,1) = J fe) ~g(0))° fy(w)u(dw)m( de); 


this integral, assumed finite, can be viewed as a joint integral or as an iterated integral 
(Fubini’s theorem). And now fy is a Bayes estimator of g with respect to m if it 
minimizes R(7, t) over t. This is analogous to the Bayes rules discussed earlier. The 


252 INTEGRATION 


following simple projection argument shows that, except in trivial cases, no Bayes 
estimator is unbaised. 

Let Q be the probability measure on €X F having density f,(w) with respect to 
7 X w, and let L? be the space of square-integrable functions on (Ox), EX F, Q). 
Then Q is finite and €X # is countably generated. Recall that an element of L* is 
an equivalence class of functions that are equal almost everywhere with respect to Q. 
Let G be the class of elements of L? containing a function of the form 20, w) = g(w) 
— functions of @ alone. Then G is a subspace. (That G is algebraically closed is clear; 
if f,€G and ||f, —fll 20, then—see the proof of Theorem 19.1—some subse- 
quence converges to f outside a set of Q-measure 0, and it follows easily that fEG,) 
Similarly, let T be the subspace of functions of w alone: (0, w) = t(w). Consider only 
functions g and their estimators ¢ for which the corresponding g and t are in lig 

Suppose now that f, is both an unbiased estimator of gg and a Bayes estimator of 
go with respect to m. By (19.16) for go, R(r, t) =|lt — goll , and since ty is a Bayes 
estimator of go, it follows that [fp — oll? < I — Foll? for all f in T. This means that fy 
is the orthogonal projection of go on the subspace T and hence that go — fo 1 to. On 
the other hand, from the assumption that fọ is an unbiased estimator of go, it follows 
that, for every 2(0, w) = g(@) in G, 


(0-808) = f | CoCo) ~ 80(8)) 8(6) fol w)u(de») (a8) 


i fe | [ (206) — 89(9)) fo(@)u (dw) |r( dé) = 0. 


This means that tọ — Zo L G: Zo is the orthogonal projection of fy on the subspace G. 
But 2)—t) fo and t) 2) o together imply that tọ — Zo is orthogonal to itself: 
to = Zo. Therefore, to(w) = t9(, w) = 2)(0, w) =8(0) for (@,w) outside a set of Q- 
measure 0. 

This implies that fg and gy are essentially constant. Suppose for simplicity that 
fw) > 0 for all (8, w), so that (Theorem 15.2) (m X w)[(0, w): to(@) + g(0)] = 0. By 
Fubini’s theorem, there is a 0 such that, if a=g,(@), then læ: t)(w) #a] =0; and 
there is an w such that, if b = tọ(w), then m[0: g)(0)#b]=0. It follows that, for 
(6,w) outside a set of (m X u)-measure 0, t)(w) and g,(@) have the common value 
a=b: w[@: g,(6)=a]=1 and P,[w: t)(w) =a] = 1 for all 0 in ©. 


PROBLEMS 
19.1. Suppose that u() <% and fE L”. Show that || fl, tll fll. 


19.2. (a) Show that L”((0, 1], Z,A) is not separable. 
(b) Show that L’((0, 1], Z, u) is not separable if u is counting measure (u is 
not o-finite). 
(c) Show that L?(0, FY, P) is not separable if (Theorem 36.2) there is on the 
space an independent stochastic process [ X,: 0 <t < 1] such that X, takes the 
values +1 with probability $ each (F is not countably generated). 


This is interesting because of the close connection between Bayes rules and admissibility; see 
BERGER, pp. 546 ff. 


SECTION 19. THE LP SPACES 253 


19.3. 


19.4. 


19.8. 


199: 


Show that Theorem 19.3 fails for L*((0,1],@,A). Hint: Take y(f) to be a 
Banach limit of n/}/"f(x) dx. 


Consider weak convergence in L’((0, 1], @, A). 


(a) For the case p = œ, find functions f,, and f such that f, goes weakly to f 
but || f—f, ll, does not go to 0. 


(b) Do the same for p = 2. 


5. Show that the unit ball in L'((0, 1], @, A) is not weakly compact. 


- Show that a Bayes rule corresponding to p = (p,,..., p,) may not be admissible 


if p; = 0 for some i. But there will be a better Bayes rule that is admissible. 


- The Neyman—Pearson lemma. Suppose f, and f, are rival densities and L(j|i) is 


0 or l as j =i or j #i, so that RS) is the probability of choosing the opposite 
density when f; is the right one. Suppose of 6 that 5,(w)=1 if f,(w) > tf,() 
and ô (w) = 0 if f (w) < tf,(w), where t > 0. Show that ô is admissible: For any 
rule 6’, (65 fı du < f6,f,;dw implies [5 f, du > fô, fody. Hint: Or =E 
(f2 — tf) du > 0, since the integrand is nonnegative. 


The classical orthonormal basis for L7[0,27] with Lebesgue measure is the 
trigonometric system 


(19.17) (27) ', m '/? sin nx, m '/? cos nx, me l 2N 


Prove orthonormality. Hint: Express the sines and cosines in terms of ee 
e ‘"*, multiply out the products, and use the fact that {?7e""* dx is 2a or 0 as 
m =0 or m # 0. (For the completeness of the trigonometric system, see Problem 


26.26.) 


Drop the assumption that L? is separable. Order by inclusion the orthonormal 
systems in L*, and let (Zorn’s lemma) ® = [e,: y ET] be maximal. 

(a) Show that T,= [y: (f, ¢,) # 0] is countable. Hint: Use £7_(f,¢,,)° <IIfll’ 
and the argument for Theorem 10.2(iv). 

(b) Let Pf= Lrer(f, ~,)~,- Show that f—Pf1® and hence (maximality) 
f= Pf. Thus ® is an orthonormal basis. 

(c) Show that ® is countable if and only if L? is separable. 

(d) Now take ® to be a maximal orthonormal system in a subspace M, and 
define Pyf =E, erf, p,)p,. Show that PyfeM and f- Puf- ®, that g = 
Pug if gEM, and that f—Pyf1tM. This defines the general orthogonal 
projection. 


CHAPTER 4 


Random Variables 
and Expected Values 


SECTION 20. RANDOM VARIABLES AND DISTRIBUTIONS 


This section and the next cover random variables and the machinery for 
dealing with them—expected values, distributions, moment generating func- 
tions, independence, convolution. 


Random Variables and Vectors 


A random variable on a probability space (Q, F, P) is a real-valued function 
X = X(w) measurable F. Sections 5 through 9 dealt with random variables of 
a special kind, namely simple random variables, those with finite range. All 
concepts and facts concerning real measurable functions carry over to ran- 
dom variables; any changes are matters of viewpoint, notation, and terminol- 
ogy only. 

The positive and negative parts X* and X` of X are defined as in (15.4) 
and (15.5). Theorem 13.5 also applies: Define 


e= NE if o2 sak 
(20.1) U(x) = 1<k<n2", 
n S. 


If X is nonnegative and X, = y„(X), then 0 < X,, 1 X. If X is not necessarily 
nonnegative, define 


vol Wel XO” alt ete 
(22) a —W,(-X) if X<0. 


(This is the same as (13.6).) Then 0< X,(w)t X(w) if X(w)>0 and 0> 
254 


SECTION 20. RANDOM VARIABLES AND DISTRIBUTIONS 255 


Xæ)! X(w) if X(w) <0; and |X,(w)|t|X(w)| for every w. The random 
variable X,, is in each case simple. 

A random vector is a mapping from Q to R* that is measurable F. Any 
mapping from Q to R* must have the form w > X(w) =(X(w),..., X w), 
where each X,(w) is real; as shown in Section 13 (see (13.2)), X is measur- 


able if and only if each X, is. Thus a random vector is simply a k-tuple 
X =(X,,..., X;,) of random variables. 


Subfields 


If ¥ is a o-field for which c F, a k-dimensional random vector X is of 
course measurable X if [w: X(w) € H] © F for every H in #*. The o-field 
a(X) generated by X is the smallest o-field with respect to which it is 
measurable. The o-field generated by a collection of random vectors is the 
smallest o-field with respect to which each one is measurable. 

As explained in Sections 4 and 5, a sub-g-field corresponds to partial 
information about w. The information contained in o(X)=0(X i- A) 
consists of the k numbers X,(w),..., X (w). The following theorem is the 
analogue of Theorem 5.1, but there are technical complications in its proof. 


Theorem 20.1. Let X=(X,,..., X,) be a random vector. 


(i) The o-field o(X)=0(X,,..., X,) consists exactly of the sets [X €H] 
for H € B*. 
(ii) In order that a random variable Y be measurable o(X) = o( X WAR. n). 


it is necessary and sufficient that there exist a measurable map f: R > R! such 
that Y(w) = f(X(o),..., X,(@)) for all w. 


Proor. The class ¥ of sets of the form [X €H] for HE F* is a 
o-field. Since X is measurable o(X), YC a(X). Since X is measurable Í, 
a(X) CY. Hence part (i). 

Measurability of f in part (ii) refers of course to measurability A*/Z'. 
The sufficiency is easy: if such an f exists, Theorem 13.1(ii) implies that Y is 
measurable o( X ), 

To prove necessity,* suppose at first that Y is a simple random variable, 
and let y,,...,y,, be its different possible values. Since A, =[w: Y(w) =y] 
lies in o(X), it must by part (i) have the form [w: X (w) € H;] for some H, in 
P“. Put f=L, yl; certainly f is measurable. Since the A ; are disjoint, no 
X(w) can lie in more than one H, (even though the latter need not be 
disjoint), and hence f(X(w)) = Y(w). 


"The Partition defined by (4.16) consists of the sets [w: X(w) = x] for x € R*. 
For a general version of this argument, see Problem 13.3. 


256 RANDOM VARIABLES AND EXPECTED VALUES 


To treat the general case, consider simple random variables Y, such that 
Y,(w) > Y(w) for each w. For each n, there is a measurable function F 
R* > R' such that Y,(w) = f,( X(w)) for all w. Let M be the set of x in Ré 
for which {f,(x)} converges; by Theorem 13.4(iii), M lies in A*. Let 
f(x) =lim, f(x) for x in M, and let f(x)=0 for x in R*—M. Since 
f=lim, f Iu and f„Ių is measurable, f is measurable by Theorem 13.4(ii). 
For each w, Y(w) = lim, f,(X(@)); this implies in the first place that X(w) 
lies in M and in the second place that Y(w) = lim, f,(X(w)) =f(X(o)). m 


Distributions 


The distribution or law of a random variable X was in Section 14 defined as 
the probability measure on the line given by u = PX! (see (13.7)), or 


(20.3) u(A)=P[XEA], AEF. 
The distribution function of X was defined by 

(20.4) F(x) =p(-, x] =P[X <x] 
for real x. The left-hand limit of F satisfies 


F(x-)=p(-%, x) =P[X <x], 


rey Ba) — Flor) =he{ x), pea 

and F has at most countably many discontinuities. Further, F is nondecreas- 
ing and right-continuous, and lim, _, _,, F(x) = 0, lim, „„ F(x) = 1. By Theo- 
rem 14.1, for each F with these properties there exists on some probability 
space a random variable having F as its distribution function. 

A support for u is a Borel set S for which u(S) = 1. A random variable, 
its distribution, and its distribution function are discrete if u has a countable 
support S={x,,x,,...}. In this case u is completely determined by the 
values p{x,}, u{x,},.... 

A familiar discrete distribution is the binomial: 


(20.6) P[X=r]=yp{r} = ("ora <a Bh S as E EE 


There are many random variables, on many spaces, with this distribution: If 
{X;,} is an independent sequence such that P[X, = 1])=p and P[X, =0]=1 
— p (see Theorem 5,3), then X could be L!_,X;, or L8*#X,, or the sum of 
any n of the X,, Or Q could be {0,1,...,n} if F consists of all subsets, 
P{r} = u{r}, r=0,1,...,n, and X(r)=r. Or again the space and random 
variable could be those given by the construction in either of the two proofs 
of Theorem 14.1. These examples show that, although the distribution of a 


SECTION 20. RANDOM VARIABLES AND DISTRIBUTIONS 257 


random variable X contains all the information about the probabilistic 
behavior of X itself, it contains beyond this no further information about the 
underlying probability space (Q, F, P) or about the interaction of X with 
other random variables on the space. 

Another common discrete distribution is the Poisson distribution with 
parameter A > 0: 


r 


A 
(20.7) PLX="] =a) =e “oe Tey lon 


A constant c can be regarded as a discrete random variable with X(w) =c. 
In this case P[X=c]=p{c}=1. For an artificial discrete example, let 
{x,,X2,...} be an enumeration of the rationals, and put 


(20.8) Wajen 


the point of the example is that the support need not be contained in a 
lattice. 

A random variable and its distribution have density f with respect to 
Lebesgue measure if f is a nonnegative Borel function on R' and 


(20.9) P[XEA]=u(A)= f f(x) dr, AER'. 


In other words, the requirement is that u have density f with respect to 
Lebesgue measure A in the sense of (16.11). The density is assumed to be 
with respect to à if no other measure is specified. 

Taking A = R! in (20.9) shows that f must integrate to 1. Note that f is 
determined only to within a set of Lebesgue measure 0: if f=g except on a 
set of Lebesgue measure 0, then g can also serve as a density for X and p. 

It follows by Theorem 3.3 that (20.9) holds for every Borel set A if it holds 


for every interval—that is, if 
b 
F(b) — F(a) = f f(x) dx 


holds for every a and b. Note that F need not differentiate to f everywhere 
(see (20.13), for example); all that is required is that f integrate 
properly—that (20.9) hold. On the other hand, if F does differentiate to f 
and f is continuous, it follows by the fundamental theorem of calculus that f 


is indeed a density for F.t 


The general question of the relation between differentiation and integration is taken up in 
$ 


Section 31. 


258 RANDOM VARIABLES AND EXPECTED VALUES 


For the exponential distribution with parameter a > 0, the density is 


EN if x<0, 
The corresponding distribution function 

ef!) if * <U; 
(20.11) Retin if > 0 


was studied in Section 14. 
For the normal distribution with parameters m and a, a > 0, 


2 
1 


a change of variable together with (18.10) shows that f does integrate to 1. 
For the standard normal distribution, m = 0 and ø = 1. 
For the uniform distribution over an interval (a, b], 


ifa <x <b, 


1 
(20.13) f(x)= (a= 
0 otherwise. 

The distribution function F is useful if it has a simple expression, as in 
(20.11). It is ordinarily simpler to describe u by means of a density f(x) or 
discrete probabilities u{x,)}. 

If F comes from a density, it is continuous. In the discrete case, F 
increases in jumps; the example (20.8), in which the points of discontinuity 
are dense, shows that it may nonetheless be very irregular. There exist 
distributions that are not discrete but are not continuous either. An example 
is w(A) = żup (A) + 54,(A) for u, discrete and u, coming from a density; 
such mixed cases arise, but they are few. Section 31 has examples of a more 
interesting kind, namely functions F that are continuous but do not come 
from any density. These are the functions singular in the sense of Lebesgue; 
the Q(x) describing bold play in gambling (see (7.33)) turns out to be one of 
them. See Example 31.1. 

If X has distribution w and g is a real function of a real variable, 


(20.14) P[g(X) €A] =P|X€g87'4] =u(87'A). 


Thus the distribution of g(X) is wg~' in the notation (13.7). 


SECTION 20. RANDOM VARIABLES AND DISTRIBUTIONS 259 


In the case where there is a density, f and F are related by 
(20.15) F(x) = fo f(tjdt. 


Hence f at its continuity points must be the derivative of F. As noted above, 
if F has a continuous derivative, this derivative can serve as the density Ji 
Suppose that f is continuous and g is increasing, and let T=g~'. The 
distribution function of g(X) is Plg(X) <x]=P[X < T(x)] = F(T(x). If T 
is differentiable, this differentiates to f(T(x))T'(x), which is therefore the 
density for g(X). If g is decreasing, on the other hand, then P[g(X) <x]= 
PIX > T(x)] = 1— F(T(x)), and the derivative is equal to —f(T(x))T’(x) = 
f(T )IT'(x)|. In either case, g(X) has density 


(20.16) £ P[e(X) <x] = f(T(x))|T’(x)]. 


If X has the normal density (20.12) and a > 0, (20.16) shows that aX + b 
has the normal density with parameters am + b and ag. Finding the density 
of g(X) from first principles, as in the argument leading to (20.16), often 
works even if g is many-to-one: 


Example 20.1. If X has the standard normal distribution, then 


1 EB al SE na 2 ix a 
P| X? <x| = -= |> e"? dt = -= |` e" dt 
x] V2 fry arh 
for x > 0. Hence X? has density 


0 ii aU 


fa) = (e L e af Sig) a 
V2 


Multidimensional Distributions 

For a k-dimensional random vector X =(X,,..., X;,), the distribution „u (a 
probability measure on #*) and the distribution function F (a real function 
on R*) are defined by 

u(A)=P([(X,,..-,%,) €Al, Aem, 


(20.17) 
F (Giga st gig) = P(X, SA Ag <x] =p(S,), 


where S, =[y: y,; <x;, i= 1,...,k] consists of the points “southwest” of x. 


260 RANDOM VARIABLES AND EXPECTED VALUES 
Often u and F are called the joint distribution and joint distribution 
funcion Of X,,..., Xy 
Now F is nondecreasing in each variable, and A4F >20 for bounded 
rectangles A (see (12.12)). As h decreases to 0, the set 
Sop SLY DVS pe hyd Sda 


decreases to S,, and therefore (Theorem 2.1(ii)) F is continuous from above 


in the sense that lim,,9 F(x, +h,...,x, +h) =F(x,,...,%,). Further, 
F(x,,...,X,) 20 if x, > — œ for some i (the other coordinates held fixed), 
and F(x,,...,x,)— 1 if x; © for each i. For any F with these properties 


there is by Theorem 12.5 a unique probability measure u on &* such that 
u(A)=A,F for bounded rectangles A, and u(S,) = F(x) for all x. 

As h decreases to 0, S, _, increases to the interior S =[y: yee 
¢=1,..., k] of Sy, andiso 


(20.18) lim F(x, —h,...,%, —A) ="(S2). 
h 0 


Since F is nondecreasing in each variable, it is continuous at x if and only if 
it is continuous from below there in the sense that this last limit coincides 
with F(x). Thus F is continuous at x if and only if F(x) = a(S.) = u(S2), 
which holds if and only if the boundary ôS, = S, — SẸ (the y-set where y; <x; 
for all i and y,=x, for some i) satisfies u(dS,)=0. If k>1, F can have 
discontinuity points even if u has no point masses: if u corresponds to a 
uniform distribution of mass over the segment B = [(x,0): 0 <x < 1] in the 
plane (u(A) = Al[x: 0<x <1, (x,0)€A)), then F is discontinuous at each 
point of B. This also shows that F can be discontinuous at uncountably many 
points. On the other hand, for fixed x the boundaries ôS, , are disjoint for 
different values of h, and so (Theorem 10.2(iv)) only countably many of them 
can have positive w-measure. Thus x is the limit of points (x, + iy. gee 
at which F is continuous: the continuity points of F are dense. 

There is always a random vector having a given distribution and distribu- 
tion function: Take (Q, ¥,P)=(R*, 2" u) and X(w)=o. This is the 
obvious extension of the construction in the first proof of Theorem 14.1. 

The distribution may as for the line be discrete in the sense of having 
countable support. It may have density f with respect to k-dimensional 
Lebesgue measure: (A) = f4 f(x) dx. As in the case k = 1, the distribution 
p is more fundamental than the distribution function F, and usually u is 
described not by F but by a density or by discrete probabilities. 

If X is a k-dimensional random vector and g: R* > RÝ is measurable, 
then g(X) is an i-dimensional random vector; if the distribution of X is W, 
the distribution of g(X) is wg~', just as in the case k = 1—see (20.14). If gj: 
R* > R' is defined by g,(x,,...,x,) =x;, then g(X) is X,, and its distribu- 
tion w;="g;' is given by wj(A)=pl(y,,..., Xk): x; A] =PLX, € A] for 


SECTION 20. RANDOM VARIABLES AND DISTRIBUTIONS 261 


AE &'. The u, j are the marginal distributions of m. If u has a density f in 
R*, then yw, has over the line the density 


(20.19) 


cota | FC oqo ens paysa py inenn, ey) dey’ diya dtyyy ot Megs 


since by Fubini’s theorem the right side integrated over A comes to 
pl agency cans x, EA). 


Now suppose ‘that g iS a One-to-one, continuously differentiable map of V 
onto U, where U and V are open sets in R*. Let T be the inverse, and 
sagt its Jacobian J(x) never vanishes. If X has a density f supported by 

, then for A CU, Plg(X) € A] =P[X € TA] = f,, f(y) dy, and by (17.10), 
me equals [4 f(Tx)|J(x)| dx. Therefore, g(X) has density 


(20.20) d(x) = a for x€ U, 
for x £ U. 


This is the analogue of (20.16). 
Example 20.2. Suppose that (X,, X,) has density 
if Gen es) = (Aa) ‘exp| — 4/ (ee Pex Nhs 
and let g be the transformation to polar coordinates. Then U, V, and T are 
as in Example 17.7. If R and © are the polar coordinates of (X,, X,), then 
(R, J= g(X,, X,) has density Om) per. */2 in V. By (20. 19), R fae density 


pe’ /2 on (0,%), and @ is uniformly distributed over (0,277). 5 


For the normal distribution in R“, see Section 29. 


Independence 
Random variables X,,..., X, are defined to be independent if the o-fields 
a(X,),...,0(X,) they generate are independent in the sense of Section 4. 


This concept for simple random variables was studied extensively in Chapter 
1; the general case was touched on in Section 14. Since o(X;) consists of the 
sets [X, € H] for HE Z', X,,..., Xy are independent if and only if 


(20.21) P[X,€H,,...,X,@H,)=P[X, €H,] -:: PLX, EH] 
for all linear Borel sets H,,...,H,. The definition (4.10) of independence | 


requires that (20.21) hold also i some of the events [ X; € H;] are suppressed 
on each side, but this only means taking H; = R'. noi 


262 RANDOM VARIABLES AND EXPECTED VALUES 


Suppose that 
(20.22) PX, Sx,,...,%, Sx, ]) = PLEX, Sa] PLA, Se] 


for all real x,,...,X,; it then also holds if some of the events [X; <x,] are 
suppressed on each side (let x; > ©). Since the intervals (—%, x] form a 
a-system generating #', the sets [X; <x] form a 7-system generating o(X,). 
Therefore, by Theorem 4.2, (20.22) implies that X,,..., X, are independent. 
If, for example, the X, are integer-valued, it is enough that PX, = 


re X, =n,)] = PLX, =n]: PX, = nl for integral. 745, <,%, Meee 
(5.9)). 

Let (X,,..., X,) have distribution u and distribution function F, and let 
the X, have distributions u; and distribution functions F; (the marginals). By 
(20.21), X,,..., X, are independent if and only if u is product measure in 


the sense of Section 18: 

(20.23) jt = Xe 

By (20.22), X,,..., X, are independent if and only if 
(20.24) F(X) pece 5 Xp) =F On) Cea 


Suppose that each u; has density f;; by Fubini’s theorem, f,(y,)--- f,(y,) 
integrated over (—~, x,] X «+: X(—~, x,] is just F\(x,)--- F,(x,), so that 
uw has density 


(20.25) Ih) Sloan) ee ete Coe) 


in the case of independence. 

If FY,,...,H, are independent o-fields and X ; is measurable 4%, t= 
1,...,k, then certainly X,,..., X, are independent. 

If X; is a d;-dimensional random vector, i= 1,...,k, then Xoga are 
by definition independent if the o-fields o(X,),...,0(X,) are independent. 
The theory is just as for random variables: X,,..., X, are independent if and 
only if (20.21) holds for H, E€ #,..., H, E B“*. Now (X,,..., X,) can be 
regarded as a random vector of dimension d = L‘_,d;; if u is its distribution 
in R¢=R4' x --- X R^ and y, is the distribution of X, in R“, then, just as 
before, X,,..., X, are independent if and only if u =, X ++ X w,. In none 
of this need the d; components of a single X, be themselves independent 
random variables. 

An infinite collection of random variables or random vectors is by defini- 
tion independent if each finite subcollection is. The argument following (5.10) i 


SECTION 20. RANDOM VARIABLES AND DISTRIBUTIONS 263 


extends from collections of simple random variables to collections of random 
vectors: 


Theorem 20.2. Suppose that 


Xii X i2 
(20.26) Xa Xn 


is an independent collection of random vectors. If F, is the o-field generated by 
the ith row, then Fi, F,,... are independent. 


Proor. Let æ, consist of the finite intersections of sets of the form 
| X;; E H] with H a Borel set in a space of the appropriate dimension, and 
apply Theorem 4.2. The o-fields ¥ =o(%), i=1,...,n, are independent 
for each n, and the result follows. a 


Each row of (20.26) may be finite or infinite, and there may be finitely or 
infinitely many rows. As a matter of fact, rows may be uncountable and there 
may be uncountably many of them. 

Suppose that X and Y are independent random vectors with distributions 
u and v in R’ and R“. Then (X,Y) has distribution u x v in Rİ x R* = Rit*, 
Let x range over RÍ and y over R*. By Fubini’s theorem, 


(20.27) (u Xv)(B)= f v[y: (x,y) E€B]u(dx), Be Q@itk. 


Replace B by (A X R*‘)NB, where AGB! and BE P!**. Then (20.27) 
reduces to 
(20.28) (4 xv)((A XR*) NB) = J rly: (x,y) E€ B]u(dx), 

AER, BERIY, 
If B,=[y: (x,y)€B] is the x-section of B, so that B, E 2% (Theorem 
18.1), then P[(x,Y)€B]=P[w: (x, Y(w)) E B)=Plw: Y(w)E€B,]=»v(B,). 
Expressing the formulas in terms of the random vectors themselves gives this 


result: 


Theorem 20.3. If X and Y are independent random vectors with distribu- 
tions u and v in Rİ and R*, then 


(20.29) P[(X,Y) €B] = f PIY) EBlu(dx), BERIK, 


264 RANDOM VARIABLES AND EXPECTED VALUES 


and 


(20.30)  P[XeA,(X,Y) €B] = f PI(x.¥) € B]u(dx), 
AEB, Begi+tk, 


Example 20.3. Suppose that X and Y are independent exponentially 
distributed random variables. By (20.29), P[Y/X > z] = fõP[Y/x = 
zlae~* dx = e ae ** dx =(1+z)~'. Thus Y/X has density (1 + z)~2 
for z>0. Since PIX è zy Y/X=z5|= [PLY /e >z,]lae~°* dx by (20.30), 


the joint distribution of X and Y/X can be calculated as well. a 


The formulas (20.29) and (20.30) are constantly applied as in this example. 
There is no virtue in making an issue of each case, however, and the appeal 
to Theorem 20.3 is usually silent. 


Example 20.4. Here is a more complicated argument of the same sort. Let 
REE X,, be independent random variables, each uniformly distributed over [0, t]. 
Let Y, be the kth smallest among the X,, so that 0<Y,< --- <Y, <t. The X; 
divide [0,¢] into n + 1 subintervals of lengths Y,, Y, — Yi5- 0-5 Yn Xn =the Yonleme 
be the maximum of these lengths. Define w(t, a) = PLM < a]. The problem is to show 
that 


n+] 


or: Monel wam il ae i 
(20.31) Wisa) = 2 ( "| ie J( k K 


t 


where x,= (x +|x|)/2 denotes positive part. 
Separate consideration of the possibilities 0 <a <t/2, t /2<a<t, and t<a 


disposes of the case n = 1. Suppose it is shown that the probability w(t, a) satisfies 
the recursion 


(20.32) alta) =n [iy (t—x,a)( 2%)" E 


Now (as follows by an integration together with Pascal’s identity for binomial 
coefficients) the right side of (20.31) satisfies this same recursion, and so it will follow 
by induction that (20.31) holds for all n. 

In intuitive form, the argument for (20.32) is this: If [M <a] is to hold, the smallest 
of the X; must have some value x in [0,a]. If X, is the smallest of the X,, then 
X,,-..,X,, must all lie in [x,t] and divide it into subintervals of length at most a; the 
probability of this is (1 —x/t)"~'w,_ \(t — x,a), because Xas... X, have probability 
(1 —x/t)"~! of all lying in [x,t], and if they do, they are independent and uniformly 
distributed there. Now (20.32) results from integrating with respect to the density for 
X, and multiplying by n to allow for the fact that any of Xo. X, may be the 
smallest. 

To make this argument rigorous, apply (20.30) for j= 1 and k=n—1. Let A be 


the interval [0, a], and let B consist of the points (x,,..., x,,) for which 0 Sx, Shwe 
is the minimum of x,,...,X,, and x,,..., x, divide [x], t] into subintervals of we 
at most a. Then P[X, = min X,, M<a]=P[X, €A, (X,,..., X,) € B]. Take X, for 


SECTION 20. RANDOM VARIABLES AND DISTRIBUTIONS 265 


X and (X,,..., X,,) for Y in (20.30). Since X, has density 1/t, 
a dx 
(20.33) P[X, = min X,, M <a) = f Pilt Karir Xa) EB]. 
0 


If C is the event that x < X, <t for 2 <i <n, then P(C)=(1—x/t)" |. A simple 
calculation shows that P[ X, — x < s; 2<i<nlC]=I1"_,(s,/(t —x)); in other words, 
given C, the random variables X, — x,..., X„,—x are conditionally independent and 
uniformly distributed over [0, £ — x]. Now X,,..., X„ are random variables on some 
probability space (Q, Z, P), replacing P by P(-|C) shows that the integrand in 
(20.33) is the same as that in (20.32). The same argument holds with the index 1 
replaced by any k (1 <k <n), which gives (20.32). (The events [X, = min X, Y <a] 
are not disjoint, but any two intersect in a set of probability 0.) = 


Sequences of Random Variables 


Theorem 5.3 extends to general distributions p,. 


Theorem 20.4. If {u,,} is a finite or infinite sequence of probability mea- 
sures on &', there exists on some probability space (Q, F , P) an independent 
sequence {X,,} of random variables such that X,, has distribution w,,. 


Proor. By Theorem 5.3 there exists on some probability space an 
independent sequence Z,, Z,,... of random variables assuming the values 0 
and 1 with probabilities P[Z, =0]=P[Z, =1]= 4. As a matter of fact, 
Theorem 5.3 is not needed: take the space to be the unit interval and the 
Z,(w) to be the digits of the dyadic expansion of w—the functions d,(w) of 
Sections and 1 and 4. 

Relabel the countably many random variables Z, so that they form a 
double array, 


Zii Ziz 
Zai Z» 


All the Z,, are independent. Put U, = L7_,Z,,2~*. The series certainly 
converges, and U, is a random variable by Theorem 13.4. Further, U,, U,,... 
is, by Theorem 20.2, an independent sequence. 

Now P[Z,,, =z;, 1 si sk]=27* for each sequence z,,...,2, of 0’s and 
l’s; hence the 2% possible values j2~*, 0 <j <2*, of S,,=L4_,Z,,27! all 
have probability 2~*. If 0 <x < 1, the number of the j2~* that lie in [0, x] is 
[2x] + 1, and therefore P[S,, <x] =((2*x] + 1)/2*. Since S,,,(w)t U,(w) as 
kt, it follows that [S,,<x]1[U,<x] as kt~, and so P[U, <x]= 
lim, P[S,, <x] = lim, (2‘x] + 1)/2* =x for 0 <x < 1. Thus U, is uniformly 
distributed over the unit interval. 


266 RANDOM VARIABLES AND EXPECTED VALUES 


The construction thus far establishes the existence of an independent 
sequence of random variables U, each uniformly distributed over [0, 1]. Let 
F, be the distribution function corresponding to ,,, and put »,(u) = inff x: 
u<F(x)] for 0<u <1. This is the inverse used in Section 14—see (14.5). 
Set ¢,(u)=0, say, for u outside (0,1), and put X,(w) = ¢,(U,(w)). Since 
,(u) <x if and only if u < F,(x)—see the argument following (14.5)—P[X, 
<x]=P[U, < F{x)]= F(x). Thus X, has distribution function F,. And by 


N n 


Theorem 20.2, X,, X5,... are independent. u 


This theorem of course includes Theorem 5.3 as a special case, and its 
proof does not depend on the earlier result. Theorem 20.4 is a special case of 
Kolmogorov’s existence theorem in Section 36. 


Convolution 


Let X and Y be independent random variables with distributions w and v. 
Apply (20.27) and (20.29) to the planar set B =[(x, y): x+y €H] with 
HER.: 


(20.34) P[X+ YeH]= f v(H —x)p( dx) 

=f PIYeH-x]u(dr). 
The convolution of u and v is the measure u * v defined by 
(20.35) (o) = f v(H-x)u(dx), HER.. 


If X and Y are independent and have distributions u and v, (20.34) shows 
that X +Y has distribution u*v. Since addition of random variables is 
commutative and associative, the same is true of convolution: p*v=vžķ 
and u *(v*n)=(u*v)*n. 

If F and G are the distribution functions corresponding to u and v, the 
distribution function corresponding to u*v is denoted F » G. Taking H= 
(—, y] in (20.35) shows that 


(20.36) (F *G)(y) =f G(y —x) dF(x). 


(See (17.22) for the notation dF(x).) If G has density g, then G(y =x) = 

r 8(s) ds = f” „g(t —x)dt, and so the right side of (20.36) is [sl / e80 — 

—x)dF(x)]dt by Fubini’s theorem. Thus F *G has density F » g, where 
w 


(20.37) (F*g)(y) = f” 8(y =x) dF(x); R 


SECTION 20. RANDOM VARIABLES AND DISTRIBUTIONS 267 


this holds if G has density g. If, in addition, F has density f, (20.37) is 
denoted f * g and reduces by (16.12) to 


(20.38) (Feay(y)= f e(y—x)f(2) de. 


This defines convolution for densities, and u * v has density f * g if u and v 


have densities f and g. The formula (20.38) can be used for many explicit 
calculations. 


Example 20.5. Let X,,...,X, be independent random variables, each 
with the exponential density (20.10). Define g, by 


(ax)* l ee 
(20.39 (x) =a >e = 
(20.39) g(x) te ay 20, kel? 


put g,(x)=0 for x <0. Now 
Y 
(8r-1*81)O) = f Bey x) 82) dr, 


which reduces to g,(y). Thus g,=g,_,* 8, and since g, coincides with 
(20.10), it follows by induction that the sum X, + --- +X, has density g,. 
The corresponding distribution function is 


k-—1 x i (o) x i 
(240) Try E ay bas Vri N D SU: 
i=0 Í i=k 


as follows by differentiation. ® 


Example 20.6. Suppose that X has the normal density (20.12) with m = 0 
and that Y has the same density with 7 in place of ø. If X and Y are 
independent, then X + Y has density 


2 
hia. 48 (Y= 9) ae 
27 OT f |- 202 x] 


A change of variable u = x(a? + r”)'/*/or reduces this to 
2 
ye y? 
ex u — ERE 
eT Talo |- AE Te“ 


pe any 2 Aya +77) 
h 


E a e) 


268 RANDOM VARIABLES AND EXPECTED VALUES 


Thus X + Y has the normal density with m =0 and with ø? + 7? in place of 
Cn a 


If u and v are arbitrary finite measures on the line, their convolution is 
defined by (20.35) even if they are not probability measures. 


Convergence in Probability 
Random variables X,, converge in probability to X, written X, >p X, if 
(20.41) lim P[|X,, —X|>e] =0 

n 


for each positive e. This is the same as (5.7), and the proof of Theorem 5.2 
carries over without change (see also Example 5.4). 


Theorem 20.5. (i) If X„ >X with probability 1, then X„ >p X. 


(ii) A necessary and sufficient condition for X, —>p X is that each subse- 


quence E contain a further subsequence 2G og: such that xe — X with 
probability 1 as i —> œ. 


PRoor. Only part (ii) needs proof. If X,, >p X, then given {n,}, choose a 
subsequence {n,,;} so that k > k(i) implies that BIAG =X|>i "]<20SBy 
the first Borel—Cantelli lemma there is probability 1 that X00 —X|<i7! for 
all but finitely many 7. Therefore, lim, A = X with probability 1. 

If X,, does not converge to X in probability, there is some positive e for 
which P[|X,, — X| > e] > e holds along some sequence {n,,}. No subsequence 


of 1X.) can converge to X in probability, and hence none can converge to X 
with probability 1. a 


It follows from (ii) that if X, >p X and X, >p Y, then X=Y with 
probability 1. It follows further that if f is continuous and Xn >p A. en 
fO ppfh®. 

In nonprobabilistic contexts, convergence in probability becomes conver- 
gence in measure: If f, and f are real measurable functions on a measure 


space (Q, F, u), and if ulw: |f(w) — f,(w)|>€] > 0 for each e> 0, then f, 
converges in measure to f. 


The Glivenko-Cantelli Theorem“ 


The empirical distribution function for random variables Kitt Seni the 


distribution function F(x, w) with a jump of n`! at each X,(@): : 


(20.42) F,(x,w) = =r Iw, x\(X(@)), 


‘This is often expressed plim,, X, =X. 
*This topic may be omitted. 


SECTION 20. RANDOM VARIABLES AND DISTRIBUTIONS 269 


If the X, have a common unknown distribution function F(x), then F(x, w) 
is its natural estimate. The estimate has the right limiting behavior, according 
to the Glivenko—Cantelli theorem: 


Theorem 20.6. Suppose that X,, X,,... are independent and have a com- 
mon distribution function F; put D,(w) = sup, |F (x, w) — F(x)|. Then D, > 0 
with probability 1 


For each x, F,(x,@) as a function of w is a random variable. By right 
continuity, the supremum above is unchanged if x is restricted to the 
rationals, and therefore D,, is a random variable. 

The summands in (20.42) are independent, identically distributed simple 
random variables, and so by the strong law of large numbers (Theorem 6.1), 
for each x there is a set A, of probability 0 such that 


(20.43) lim F,(x, w) = F(x) 


for w €A,. But Theorem 20.6 says more, namely that (20.43) holds for w 
outside some set A of probability 0, where A does not depend on x—as 
there are uncountably many of the sets A,, it is conceivable a priori that 
their union might necessarily have positive measure. Further, the conver- 
gence in (20.43) is uniform in x. Of course, the theorem implies that with 
probability 1 there is weak convergence F.\(x,w) = F(x) in the sense of 
Section 14. 


PROOF OF THE THEOREM. As already observed, the set A, where (20.43) 
fails has probability 0. Another application of the strong law of large 
numbers, with J,_..,) in place of I- ,, in (20.42), shows that (see (20.5)) 
lim, F(x —,@) = F(x — ) except on a set B, of probability 0. Let g(u) = 
inf[x: u < F(x)] for 0<u <1 (see (14.5)), and put x,, ,=9(k/m), m21, 
1 <k <m. It is not hard to see that F(p(u) — ) <u < F(g(u)); hence F(x,, x 
-)- F(x, sm FG, yo) s/t, aude Ce eee 
ie be the maximum of the eee (Etin es Oa Pon i and 
IF (4m 4-0) — F(x,, , —)\ for k =1,. 

If ee ge eH za y then BAY Pi PEA EET a CADE 

Dm (o) < F(x) + ie $ ID) po and F(x, w)> F nO mens EFO 
-DnA = F(x) - = Diagn: Together with similar arguments for the 
cases x <x,,, and x>x this shows that 


(20.44) Dio) 2D. a) +m! 


If œ lies outside the union A of all the A, and B, „p then lim, Dy, a f 
= 0 and hence lim, D,(w) = 0 by (20.44). But Å has probability 0. e 


270 RANDOM VARIABLES AND EXPECTED VALUES 


PROBLEMS 


20.1. 2.117 A necessary and sufficient condition for a o-field Y to be countably 
generated is that ¢=o(X) for some random variable X. Hint: If Y= 
o(A,, A>,...), consider X = E} -1 fUy,)/10*, where f(x) is 4 for x = 0 and 5 
for x # 0. 


20.2. If X is a positive random variable with density f, then X~' has density 
f /x)/x?. Prove this by (20.16) and by a direct argument. 


20.3. Suppose that a two-dimensional distribution function F has a continuous 
density f. Show that f(x, y) =0?F(x, y)/ðx dy. 


20.4. The construction in Theorem 20.4 requires only Lebesgue measure on the unit 
interval. Use the theorem to prove the existence of Lebesgue measure on RF., 
First construct A, restricted to (—n,n] X <+: xX(—n,n], and then pass to the 
limit (n — œ). The idea is to argue from first principles, and not to use previous 
constructions, such as those in Theorems 12.5 and 18.2. 


20.5. Suppose that A, B, and C are positive, independent random variables with 
distribution function F. Show that the quadratic Az? + Bz + C has real zeros 
with probability (> /¢F(«*/4y) dF(x) dF(y). 


20.6. Show that X,, X,,... are independent if o(X,,...,X,_,) and o(X,,) are 
independent for each n. 


20.7. Let Xo, X,,... be a persistent, irreducible Markov chain, and for a fixed state 
j let 7,,T,,... be the times of the successive passages through j. Let Z, =T, 
and Z,=T, —T,,_,, n22. Show that Z,,Z,,... are independent and that 


PIZ =k]=f," for n> 2. 


20.8. Ranks and records. Let X,, X,,... be independent random variables with a 
common continuous distribution function. Let B be the w-set where 
X,,(@) =X,,(w) for some pair m,n of distinct integers, and show that P(B) = 0. 
Remove B from the space Q on which the X, are defined. This leaves the 
joint distributions of the X,, unchanged and makes ties impossible. 

Let T°(@) =(T{"(o),..., T(@)) be that permutation (ti. O en 
(1,... n) for which X, (w) <X,(w) < --- <X,(w). Let Y, be the rank of X, 
among X,,...,X,: Y, =r if and only if X,;<X, for exactly r— 1 values of i 
preceding n. 

(a) Show that T™ is uniformly distributed over the n! permutations. 
(b) Show that P[Y, =r]=1/n,1<r<n. 

(c) Show that Y, is measurable o(T”) for k <n. 

(d) Show that Y,,Y,,... are independent. 


20.9. t Record values. Let A, be the event that a record occurs at time n: 
max, cn A; SAns 
(a) Show that A,, A2,... are independent and P(A,)=1/n. 
(b) Show that no record stands forever. 
(c) Let N, be the time of the first record after time n. Show that P[N, =n + 
kl=n(n+k-1) (n+k)" 1. 


SECTION 20. RANDOM VARIABLES AND DISTRIBUTIONS 271 


20.10. 


20.11. 


20.12. 


20.14. 


20.15. 


20.16. 


Use Fubini’s theorem to prove that convolution of finite measures is commuta- 
tive and associative. 


Suppose that X and Y are independent and have densities. Use (20.20) to find 
the joint density for (X + Y, X) and then use (20.19) to find the density for 
X + Y. Check with (20.38). 


If F(x — €) < F(x + e) for all positive e, then x is a point of increase of F (see 
Problem 12.9). If F(x — ) < F(x), then x is an atom of F. 


(a) Show that, if x and y are points of increase of F and G, then x +y is a 
point of increase of F + G. 


(b) Show that, if x and y are atoms of F and G, then x+y is an atom of 
Fx*G. 


3. Suppose that u and v consist of masses a, and B, at n, n=0,1,2,.... Show 


that u*v consists of a mass of i=0%,%B,-, at n, n=0,1,2,.... Show that 
two Poisson distributions (the parameters may differ) convolve to a Poisson 
distribution. 


The Cauchy distribution has density 


a u 


(20.45) c, (x)= = yn” 


—O<xX<@, 


for u > 0. (By (17.9), the density integrates to 1.) 


(a) Show that c,*c,=c,4,- Hint: Expand the convolution integrand in 
partial fractions. 
(b) Show that, if X,,...,X,, are independent and have density c,, then 


(X, +-+- +X,,)/n has density c,, as well. 

+ (a) Show that, if X and Y are independent and have the standard normal 
density, then X/Y has the Cauchy density with u = 1. 

(b) Show that, if X has the uniform distribution over (—7/2,7/2), then 
tan X has the Cauchy distribution with u = 1. 


18.187 Let X,,...,X, be independent, each having the standard normal 
distribution. Show that 


KEXA 4X2 


has density 


1 (n /2)—1,—x/2 
a e 
(20.46) IAT (n2) Xx 


over (0,). This is called the chi-squared distribution with n degrees of freedom. 


272 


20.17. 


20.18. 


20.19. 


20.20. 


20.21. 


20.22. 


20.23. 


20.24. 


RANDOM VARIABLES AND EXPECTED VALUES 
tT The gamma distribution has density 


u 
a u-—l,-ax 


(20.47) f(x;a,u)= Fuj e 


over (0, œ) for positive parameters a and u. Check that (20.47) integrates to 1. 
Show that 


(20.48) fC-sa,u)*fCsa,v) =fCj;a,utov). 


Note that (20.46) is f(x; 4, n/2), and from (20.48) deduce again that (20.46) is 
the density of y,. Note that the exponential density (20.10) is f(x; a, 1), and 
from (20.48) deduce (20.39) once again. 


T Let N,X,,X>,... be independent, where P[N=n]=q"~'p, n>1, and 
each X, has the exponential density f(x; a, 1). Show that X, + +: +X, has 
density f(x; ap, 1). 


Let A,,,,(€) =[|Z, — Z|<e, n<k <m]. Show that Z, — Z with probability 1 
if and only if lim, lim,, P(CA,,,,(e)) = 1 for all positive e, whereas Z, >p Z if 
and only if lim,, P(A,,,(€)) = 1 for all positive e. 


(a) Suppose that f: R? > R' is continuous. Show that X, >p X and Y, >p Y 
mply fX Y) >p f(X,Y). 


(b) Show that addition and multiplication preserve convergence in probability. 


Suppose that the sequence {X,} is fundamental in probability in the sense that 
for e positive there exists an N, such that P[|X,, —X,|>e]<e for m,n>N.. 
(a) Prove there is a subsequence Xe and a random variable X such that 
lim; x, =X with probability 1. Hint: Choose increasing n ; such that 
PiX —X,,|>27*]<2~* for m,n z= ng Analyze PIX asa nice a 

(b) Show that X, >p X. 


(a) Suppose that X, <X, < --: and that X, >, X. Show that X,, >X with 
probability 1. 

(b) Show by example that in an infinite measure space functions can converge 
almost everywhere without converging in measure. 


If X, + 0 with probability 1, then n~'D%_,X, > 0 with probability 1 by the 
standard theorem on Cesaro means [A30]. Show by example that this is not so 
if convergence with probability 1 is replaced by convergence in probability. 


2.197 (a) Show that in a discrete probability space convergence in probabil- 
ity is equivalent to convergence with probability 1. 

(b) Show that discrete spaces are essentially the only ones where this equiva- 
lence holds: Suppose that P has a nonatomic part in the sense that there is a 
set A such that P(A)>0 and P(:|A) is nonatomic. Construct random 
variables X,, such that X,, >p 0 but X, does not converge to 0 with probabil- 
ity 1. 


SECTION 21. EXPECTED VALUES 273 


20.25. 20.21 20.247 Let d(X,Y) be the infimum of those positive e for which 
PIX — Y| = e] =e 


(a) Show that d(X¥,Y)=0 if and only if X=Y with probability 1. Identify 
random variables that are equal with probability 1, and show that d is a metric 
on the resulting space. 


(b) Show that X, >p X if and only if d(X,, X) > 0. 
(c) Show that the space is complete. 


(d) Show that in general there is no metric do on this space such that X,, >X 
with probability 1 if and only if d)(X,, X) > 0. 


20.26. Construct in R* a random variable X that is uniformly distributed over the 
surface of the unit sphere in the sense that |¥|=1 and UX has the same 
distribution as X for orthogonal transformations U. Hint: Let Z be uniformly 
distributed in the unit ball in R*, define y(x) =x/|x| (4(0) = (,0,...,0), say), 
and take X = W(Z). 


A 


20.27. T Let © and © be the longitude and latitude of a random point on the 
surface of the unit sphere in R*. Show that ® and ® are independent, ® is 
uniformly distributed over [0,27r), and ® is distributed over [—7/2, +7/2] 
with density 4 5 cos œ. 


SECTION 21. EXPECTED VALUES 


Expected Value as Integral 


The expected value of a random variable X on (Q, F, P) is the integral of X 
with respect to the measure P: 


E[X]= [Xap = J X(#)P(de). 


All the definitions, conventions, and theorems of Chapter 3 apply. For 
nonnegative X, E[ X] is always defined (it may be infinite); for the general 
X, E[X] is defined, or X has an expected value, if at least one of E[ X*] and 
E[ X`] is finite, in which case E[X]= E[X*]— E[X_]; and X is integrable if 
and only if E[|X|]<. The integral f4XdP over a set A is defined, as 
before, as E[J,X]. In the case of simple random variables, the definition 
reduces to that used in Sections 5 through 9. 


Expected Values and Limits 


The theorems on integration to the limit in Section 16 apply. A useful fact: If 
random variables X,, are dominated by an integrable random variable, or if 
they are uniformly integrable, then E[X,] > E[ X’] follows if X„ converges to 
X in probability—convergence with probability 1 is not necessary. This 
follows easily from Theorem 20.5. 


eel, 


oe 


274 RANDOM VARIABLES AND EXPECTED VALUES 


Expected Values and Distributions 


Suppose that X has distribution w. If g is a real function of a real variable, 
then by the change-of-variable formula (16.17), 


(21.1) E[e(X)] =f g(x)u(dx). 
(In applying (16.17), replace T: Q > Q' by X: Q > R', u by P, wT! by yp, 


and f by g.) This formula holds in the sense explained in Theorem 16.13: It 
holds in the nonnegative case, so that 


(21.2) Efla] = f |8(x) a(d); 


if one side is infinite, then so is the other. And if the two sides of (21.2) are 


finite, then (21.1) holds. 
If u is discrete and u{x,, x,,...} = 1, then (21.1) becomes (use Theorem 


16.9) 
(21.3) E[e(X)] = Le(~,)ulx,)- 


If X has density f, then (21.1) becomes (use Theorem 16.11) 
(21.4) Ele(X)] =f e(«)f(x) de. 


If F is the distribution function of X and p, (21.1) can be written E[g(X)] 
= {* g(x) dF(x) in the notation (17.22). 


Moments 


By (21.2), u and F determine all the absolute moments of X: 
(21.5) E{IXI*] = f. lxl*u(dr) =f IEG) « kde 


Since j <k implies that |x|/ < 1+|x|*, if X has a finite absolute moment of 
order k, then it has finite absolute moments of orders 1,2,..., k — 1 as well. 
For each k for which (2.15) is finite, X has kth moment 


(21.6) E[X*] = f x*u(de)= f7 x* dF(x). 


These quantities are also referred to as the moments of u and of F. They 
can be computed by (21.3) and (21.4) in the appropriate circumstances. 


SECTION 21. EXPECTED VALUES 275 


Example 21.1. Consider the normal density (20.12) with m = 0 and o = 1. 
For each k, x*e~*’/? goes to 0 exponentially as x > +, and so finite 
moments of all orders exist. Integration by parts shows that 


1 k 2 k il 2 
E E en k=2,-x2/2 a 
om ‘i x*e dx en: REE dš; Ke2,d;: 


(Apply (18.16) to g(x) =x*~? and f(x) =xe~*’/2, and let a > — œ, b > œ.) 


Of course, E[X]=0 by symmetry and E[X?]= 1. It follows by induction 
that 


(21.7) B(X] =1 x35 = =), eae 


and that the odd moments all vanish. a 


If the first two moments of X are finite and E|X]=m, then just as in 
Section 5, the variance is 


(21.8) Var[ X] = E[(X—m)"] =| my ala) 
= E|X?] —m?, 


From Example 21.1 and a change of variable, it follows that a random 
variable with the normal density (20.12) has mean m and variance o?. 
Consider for nonnegative X the relation 


(21.9) E[X|= f PLX> t]de= f PL Xz ear. 
0 


Since P[X =t] can be positive for at most countably many values of t, the 
two integrands differ only on a set of Lebesgue measure 0 and hence the 
integrals are the same. For X simple and nonnegative, (21.9) was proved in 
Section 5; see (5.29). For the general nonnegative X, let X, be simple 
random variables for which 0 < X, 1 X (see (20.1)). By the monotone conver- 
gence theorem E[ X,,]t ELX]; moreover, P| X, >t]? P[X >t], and therefore 
JoPLX,, >t]dt t [EPX > t]dt, again by the monotone convergence theorem. 
Since (21.9) holds for each X,,, a passage to the limit establishes (21.9) for X 
itself. Note that both sides of (21.9) may be infinite. If the integral on the 
right is finite, then X is integrable. 
Replacing X by X/\y. ,, leads from (21.9) to 


(21:10) saa XdP=aP[X>a]+[ P[X>t]dt, a>0. 
[X>a] a 


As long as a > 0, this holds even if X is not nonnegative. 


276 RANDOM VARIABLES AND EXPECTED VALUES 


Inequalities 


Since the final term in (21.10) is nonnegative, aP[X >a] < fixzaXdP < 
E[ X). Thus 


| 1 
21.11 Pi X2ea]s = XdP < —E|X }, a>0O, 
(21.11) [Xeelsq} Elx] 
for nonnegative X. As in Section 5, there follow the inequalities 
l 
21.12 PNAS Q = S 
(21.12) [Ixia] < zf 


1 
IXI“ dP < —E||X\*]. 
|X|>a] a 


It is the inequality between the two extreme terms here that usually goes 
under the name of Markov; but the left-hand inequality is often useful, too. 
As a special case there is Chebyshev’s inequality, 


(21.43) P[IX-ml>a] < = Varl X] 


(m= E[X)). 
Jensen’s inequality 


(21.14) 9(E[X]) <E[o(X)] 


holds if œ is convex on an interval containing the range of X and if X and 
(X) both have expected values. To prove it, let I(x) = ax + b be a support- 
ing line through (E[ X’], p(E[ X ]))—a line lying entirely under the graph of ¢ 
[A33]. Then aX(w)+b < ¢(X(w)), so that aE[X] +b <E[g(X)]. But the 
left side of this inequality is p(E[LX )). 

Holder’s inequality is 


(21.15) 9 E{|X¥]] s EATE a +> =1. 


1 
p 
For discrete random variables, this was proved in Section 5; see (5.35). For 
the general case, choose simple random variables X, and Y, satisfying 
0<|X,|t|X| and 0 <|Y,IfT|Y|; see (20.2). Then (5.35) and the monotone 
convergence theorem give (21.15). Notice that (21.15) implies that if |X|? 
and |Y| are integrable, then so is XY. Schwarz’s inequality is the case 
p=g= 2 


(21.16) E[|XY{) s E'A H] Rm ya, 


If X and Y have second moments, then XY must have a first moment. 


SECTION 21. EXPECTED VALUES 277 


The same reasoning shows that Lyapounov’s inequality (5.37) carries over 
from the simple to the general case. 


Joint Integrals 


The relation (21.1) extends to random vectors. Suppose that (X,,..., X,,) has 
distribution u in k-space and g: R > R!. By Theorem 16.13, 


(21.17) Ele Xin] = f 8(%)u(dx), 


with the usual provisos about infinite values. For example, E[X;,X;,]= 
[pe x,x (dx). If E[X;]=m,, the covariance of X, and X, is 


Cov| X;, Xx; = EX, mA m,)| = RE —-m;)(x;—m;)u(dx). 
Random variables are uncorrelated if they have covariance 0. 


Independence and Expected Value 


Suppose that X and Y are independent. If they are also simple, then 
E[ XY ] = E[ X ]E[Y ], as proved in Section 5—see (5.25). Define X,, by (20.2) 
and similarly define Y, =w,(Y*)—w,(Y_). Then X, and Y, are independent 
and simple, so that E[| X,Y, |] = E[|X,|JELIY, I, and 0 </X,|7 |X 0 < |Y,11 IYI. 
If X and Y are integrable, then E[|X,Y,|] = EUX IEI IT E[LX|]- EUY IL 
and it follows by the monotone convergence theorem that E[|XY |] < œ; since 
X,Y, ~ XY and |X,Y,|<|XY|, it follows further by the dominated conver- 
gence theorem that E[XY]=lim, E[X,Y,,] = lim, ELX, JELY,] = ELX JELY). 
Therefore, XY is integrable if X and Y are (which is by no means true for 
dependent random variables) and E[ XY] = E[ X JE[Y]. 

This argument obviously extends inductively: If X,,..., X, are indepen- 
dent and integrable, then the product X, ++ X, is also integrable and 


(21.18) E[X, °°? X, |= BG] 2 ER 
Suppose that J, and J, are independent o-fields, A lies in 4, X, is 


measurable Y,, and X, is measurable %,. Then /,X, and X, are indepen- 
dent, so that (21.18) gives 


(21.19) [XX dP= | X, dP- E[ X,] 
A A 
if the random variables are integrable. In particular, 


(21.20) J X> dP = P(A)ELXD), 


278 RANDOM VARIABLES AND EXPECTED VALUES 


From (21.18) it follows just as for simple random variables (see (5.28)) that 
variances add for sums of independent random variables. It is even enough 
that the random variables be independent in pairs. 


Moment Generating Functions 


The moment generating function is defined as 
(21.21) M(s)=E[e*] =f e”u(dx) =f e* dF(x) 


for all s for which this is finite (note that the integrand is nonnegative). 
Section 9 shows in the case of simple random variables the power of moment 
generating function methods. This function is also called the Laplace trans- 
form of u, especially in nonprobabilistic contexts. 

Now foe™u(dx) is finite for s < 0, and if it is finite for a positive s, then it 
is finite for all smaller s. Together with the corresponding result for the left 
half-line, this shows that M(s) is defined on some interval containing 0. If X 
is nonnegative, this interval contains (—©,0] and perhaps part of (0, ©); if X 
is nonpositive, it contains [0,œ) and perhaps part of (—œ,0). It is possible 
that the interval consists of 0 alone; this happens, for example, if p is 
concentrated on the integers and u{n} = u{—n}=C/n? for n=1,2,.... 

Suppose that M(s) is defined throughout an interval (—Sọ, Sọ), where 
Sp > 0... Since Se SAA aoe and the latter function is integrable u for 
Isl < Sọ, so is LZ_glsx|*/k! = e!**!. By the corollary to Theorem 16.7, u has 
finite moments of all orders and 


(21.22) M(s) = ln 7E[ x “= l x" adea IS 
= k= FA ee 


Thus M(s) has a Taylor expansion about 0 with positive radius of conver- 
gence if it is defined in some (—Sp, So), 5) >0. If M(s) can somehow be 
calculated and expanded in a series L,a,5“, and if the coefficients a, can be 
identified, then, since a, must coincide with E[X*]/k!, the moments of X 
can be computed: E[X*]=a,k! It also follows from the theory of Taylor 
expansions [A29] that a,k! is the kth derivative M‘*(s) evaluated at s = 0: 


(21.23) M“(0) = E| X*] = nm x*u( dx). 


This holds if M(s) exists in some neighborhood of 0. 

Suppose now that M is defined in some neighborhood of s. If v has 
density e**/M(s) with respect to w (see (16.11)), then v has moment 
generating function N(u) = M(s + u)/M(s) for u in some neighborhood of 0. 


SECTION 21. EXPECTED VALUES 279 


But then by (21.23), N“0) = [°.x*v(dx) = [® .x*e'*u(dx)/M(s), and since 
N“ (0) = M5) /M(s), 


(21.24) M“)(s) = fx ke™u(dx). 


This holds as long as the moment generating function exists in some neigh- 
borhood of s. If s = 0, this gives (21.23) again. Taking k = 2 shows that M(s) 
is convex in its interval of definition. 


Example 21.2. For the standard normal density, 


— co 


1 S 2 1 2 ~ 2 
M(s) = e*e-* /2 dx = S72 —(x—s) /2 dx 
(s) V2 is V2 % J i l 
and a change of variable gives 


(21.25) M(s) =e* ”. 


The moment generating function in this case defined for all s. Since e° /2 
has the expansion 


1X3 X 3 XES) s2, 
GE 


Ms 


k 
2 se 1 s? 
ee aa) 
k=0 


k=0 


the moments can be read off from (21.22), which proves (21.7) once more. @ 


Example 21.3. In the exponential case (20.10), the moment generating 
function 


a—s 


(21.26) M(s) = J etae- de = <= x a" 


is defined for s <æ. By (21.22) the kth moment is k!a~*. The mean and 
variance are thus a! and a”. ad 


Example 21.4. For the Poisson distribution (20.7), 
(21.27) M(s) = Bete 70 a meg te 2 


Since M'(s)=Ae’M(s) and M"(s) =(A’e** + Ae*)M(s), the first two mo- 
ments are M’(0) =A and M”(0) =A’ + A; the mean and variance are both A. 
a 


280 RANDOM VARIABLES AND EXPECTED VALUES 


LetX ’--- Xų be independent random variables, and suppose that each 
X; has a moment generating function M,(s) = E[e**'] in (—so, Sọ). For 
|s| < Sp, each exp(sX;) is integrable, and, since they are independent, their 
product exp(sL‘_ , X;) is also integrable (see (21.18)). The moment generating 
function of X, + -+> +X, is therefore 


(21.28) M(s) =M,(s) ++: M,(s) 


in (— So, Sp). This relation for simple random variables was essential to the 
arguments in Section 9. 

For simple random variables it was shown in Section 9 that the moment 
generating function determines the distribution. This will later be proved for 
general random variables; see Theorem 22.2 for the nonnegative case and 
Section 30 for the general case. 


PROBLEMS 


21.1. Prove 


m GE doa 
V2 — œ 


differentiate k times with respect to ¢ inside the integral (justify), and derive 
(21.7) again. 


21.2. Show that, if X has the standard normal distribution, then E[|X|?"*']= 


2"nly2/r . 


21.3. 20.97 Records. Consider the sequence of records in the sense of Problem 
20.9. Show that the expected waiting time to the next record is infinite. 


21.4. 20.147 Show that the Cauchy distribution has no mean. 


21.5. Prove the first Borel—Cantelli lemma by applying Theorem 16.6 to indicator 
random variables. Why is Theorem 16.6 not enough for the second 
Borel—Cantelli lemma? 


21.6. Prove (21.9) by Fubini’s theorem. 


21.7. Prove for integrable X that 


E[X]= | PIX>t]dt- f° PLX<e]dr. 


SECTION 21. EXPECTED VALUES 281 


21.8. (a) Suppose that X and Y have first moments, and prove 


21.9. 


21.10. 


21.11. 


21.12. 


21.13. 


21.14. 


21.15. 


21.16. 


E[Y]-E[X]=f- (P[X<t<Y]-P[Y<t<X]) dt. 


(b) Let (X,Y] be a nondegenerate random interval. Show that its expected 
length is the integral with respect to ¢ of the probability that it covers t. 


Suppose that X and Y are random variables with distribution functions F 
and G. 

(a) Show that if F and G have no common jumps, then E[F(Y)] + E[GCX)] 
= 1. 

(b) If F is continuous, then E[FCX)] = 4 

(c) Even if F and G have common jumps, if X and Y are taken to be 
independent, then E[F(Y)] + E[G(X)] = 1+ P[LX =Y]. 

(d) Even if F has jumps, E[F(X)] = $ + $4, P7[X =x]. 


(a) Show that uncorrelated variables need not be indepen a 

(b) Show that Var[L7_,X,] = £7,_, Cov[X;, X,] E Varl X;] + 

2h) <i<j<n CovlX;, X;]. The cross terms drop out if the X, are uncorrelated, 
and hence drop out if they are independent. 


t Let X, Y, and Z be independent random variables such that X and Y 
assume the values 0,1,2 with probability + each and Z assumes the values 0 
and 1 with probabilities + and =. Let X’ =X and Y'= X + Z (mod 3). 

(a) Show that X’, Y’, and X’+Y’ have the same one-dimensional distribu- 
tions as X, Y, and X + Y, respectively, even though (X’, Y’) and (X,Y ) have 
different distributions. 

(b) Show that X’ and Y’ are dependent but uncorrelated. 

(c) Show that, despite dependence, the moment generating function of X’ + Y’ 
is the product of the moment generating functions of X’ and Y’. 


Suppose that X and Y are independent, nonnegative random variables and 
that E[X]=o« and E[Y]=0. What is the value common to E[XY] and 
E[X]E[Y]? Use the conventions (15.2) for both the product of the random 
variables and the product of their expected values. What if E[X]= and 
0<E[Y] < œ? 


Suppose that X and Y are independent and that f(x, y) is nonnegative. Put 
g(x) = E[ f(x, Y)] and show that E[g(X)] = E[ f(X, Y)]. Show more generally 
that fye 4g(X) dP = fye ,f(X,Y) dP. Extend to f that may be negative. 


T The integrability of X + Y does not imply that of X and Y separately. 
Show that it does if X and Y are independent. 


20.251 Write d(X,Y)=E[|X — Y|/( +|X — Y))]. Show that this is a metric 
equivalent to the one in Problem 20.25. 

For the density C exp(—|x|'/*), —2 <x < œ, show that moments of all orders 
exist but that the moment generating function exists only at s = 0, 


282 RANDOM VARIABLES AND EXPECTED VALUES 


21.17. 16.67 Show that a moment generating function M(s) defined in (— so, $o), 
Sy > 0, can be extended to a function analytic in the strip [z: -sọ < Re z < sọ]. 
If M(s) is defined in [0, sọ), sọ > 0, show that it can be extended to a function 
continuous in [z: 0 < Re z < sọ] and analytic in [z: 0 < Rez < so]. 


21.18. Use (21.28) to find the generating function of (20.39). 


21.19. For independent random variables having moment generating functions, show 
by (21.28) that the variances add. 


21.20. 20.177 Show that the gamma density (20.47) has moment generating function 
(1 —-s/a) “ for s<a. Show that the kth moment is ulu + 1)--- (u+ k-— 
1)/a*. Show that the chi-squared distribution with n degrees of freedom has 
mean n and variance 2n. 


21.21. Let X,, X5,... be identically distributed random variables with finite second 
moment. Show that nP[|X,|> evn ]> 0 and n~!/? max, <,,|X,|—p 0. 


SECTION 22. SUMS OF INDEPENDENT RANDOM VARIABLES 


Let X,, X,,... be a sequence of independent random variables on some 
probability space. It is natural to ask whether the infinite series L*_,X,, 
converges with probability 1, or as in Section 6 whether n~'Y?_,X, con- 
verges to some limit with probability 1. It is to questions of this sort that the 
present section is devoted. 


Throughout the section, S,, will denote the partial sum £7_,X, (So = 0). 


The Strong Law of Large Numbers 


The central result is a general version of Theorem 6.1. 


Theorem 22.1. If X,, X,,... are independent and identically distributed 
and have finite mean, then S,,/n > E| X,] with probability 1. 


Formerly this theorem stood at the end of a chain of results. The following 
argument, due to Etemadi, proceeds from first principles. 


Proor. If the theorem holds for nonnegative random variables, then 
nS, 58 LD) Apo TA el le Ble = E[X,] with proba- 
bility 1. Assume then that X, > 0. 

Consider the truncated random variables Y, = X, lix, <x} and their partial 
sums S% = Li_,Y,. For a > 1, temporarily fixed, let u,,=la"|. The first step 
is to prove 


(22.1) 7 p| Su, TEA 


SECTION 22. SUMS OF INDEPENDENT RANDOM VARIABLES 283 


Since the X,, are independent and identically distributed, 
Van S*] = D Var[ Y,] < E elr] 
l 


k=l = 


n 
-> e e nË |X f hx enl. 


It follows by Chebyshev’s inequality that the sum in (22.1) is at most 


S |" 7 oy 
a < E el x > ie [X;<u,]]° 


Let K=2a/(a— 1), and suppose x > 0. If N is the smallest n such that 
u,, >x, then a™ >x, and since y < 2ly] for y>1, 


D u SD a" =Ka re eae 


u,>x n>N 


Therefore, LF _ Uy Tix, <u 15 KXT' for X, > 0, and the sum in (22.1) is at 
most Ke~ 2ELX, ]< @. 

From (22.1) it follows by the first Borel—Cantelli lemma (take a union over 
positive, rational e) that (Sf — E[S% ])/u,, > 0 with probability 1. But by the 
consistency of Cesaro summation [A30], n “TELSAlL=n tte iELY,] has the 
same limit as E[Y,], namely, E[X,]. Therefore S* fone ELX, ] with proba- 
bility 1. Since 


Z PAA ¥Y, |= 2 P[X,>n] < f P[X,>t]dt=E[X,] < 0, 


n=1 n=1 


another application of the first Borel—Cantelli lemma shows that (SÈ —S,)/n 
> 0 and hence 


Dy 
(22.2) SAPR 


with probability 1. 
If u, Sk <u,,,,, then since A 20; 


u Su Sk Un+y S 
se 
Un+ı Un 


284 RANDOM VARIABLES AND EXPECTED VALUES 


But u,,, ,/u, > a, and so it follows by (22.2) that 


n+l 


S S 
EA < lim mess < lim sup y <aE[X,] 
Q k k 


with probability 1. This is true for each a > 1. Intersecting the corresponding 
sets over rational a exceeding 1 gives lim, S,/k = E[ X,] with probability 1. 
a 


Although the hypothesis that the X, all have the same distribution is used 
several times in this proof, independence is used only through the equation 
Varl|S*] = Ly_, VarlY,], and for this it is enough that the X, be indepen- 
dent in pairs. The proof given for Theorem 6.1 of course extends beyond the 
case of simple random variables, but it requires BEX?) < œ, 


Corollary. Suppose that X,,X,,... are independent and identically dis- 
tributed and E[ X7] <œ, E[X}] =% (so that E[X,]=%). Then n`'X}_ X, 
— œ with probability 1. 


Proor. By the theorem, n~'DZ_,X,— E[ X,] with probability 1, and so 
it suffices to prove the corollary for the case X, = X} > 0. If 


ee 2G NEC ayn 
“SSO ad it Ke Sty. 


then n 'Ez-1X >n 'Ez_ X > E[X™] by the theorem. Letu >a T 


The Weak Law and Moment Generating Functions 


The weak law of large numbers (Section 6) carries over without change to the 
case of general random variables with second moments—only Chebyshev’s 
inequality is required. The idea can be used to prove in a very simple way 
that a distribution concentrated on [0,%) is uniquely determined by its 
moment generating function or Laplace transform. 

For each A, let Y, be a random variable (on some probability space) 
having the Poisson distribution with parameter A. Since Y, has mean and 
variance A (Example 21.4), Chebyshev’s inequality gives 


A 
>e|< ty +0, À > o, 
€ 


SECTION 22. SUMS OF INDEPENDENT RANDOM VARIABLES 285 
Let G, be the distribution function of Y,/A, so that 

làr] dk 
k=0 : 


The result above can be restated as 


, 1 whed 
22.3 g 
i: fee) t rei. 


In the notation of Section 14, G(x) = A(x — 1) as À > œ. 


Now consider a probability distribution u concentrated on [0, œ). Let F be 
the corresponding distribution function. Define 


(22.4) M(s) = J eudi), s>0; 


here 0 is included in the range of integration. This is the moment generating 
function (21.21), but the argument has been reflected through the origin. It is 
a one-sided Laplace transform, defined for all nonnegative s. 

For positive s, (21.24) gives 


(22.5) M(s) CD ea e RC) 


Therefore, for positive x and s, 


[sx] 


mo GD eyes) = f Leo u(ay) 


= f Ga 5) uo). 


Fix x >0. Iff 0 <y <x, then G,,(x/y) > 1 as s > & by (22.3); if y >x, the 
limit is 0. If u{x} =0, the integrand on the right in (22.6) thus converges as 
s > to Ig ,(y) except on a set of -measure 0. The bounded convergence 
theorem then gives 


[sx J 4 
(22.7) fe, one MS a 
S=? k= 


s*M“(s) = p[0, x] =F(x). 


tif y = 0, the integrand in (22.5) is 1 for k = 0 and 0 for k > 1; hence for y = 0, the integrand in 
the middle term of (22.6) is 1. 


286 RANDOM VARIABLES AND EXPECTED VALUES 


Thus M(s) determines the value of F at x if x >0 and p{x} =0, which 
covers all but countably many values of x in [0,). Since F is right-continu- 
ous, F itself and hence u are determined through (22.7) by M(s). In fact ws is 
by (22.7) determined by the values of M(s) for s beyond an arbitrary sọ: 


Theorem 22.2. Let u and v be probability measures on [0,~). If 
f eu( ar) = J e “v(dx), S È Sos 
0 0 


where Sọ > 0, then u =v. 


Corollary. Let f, and f, be real functions on [0,%). If 


[emia dr, > 50, 
0 0 


where Sọ > 0, then f, =f, outside a set of Lebesgue measure 0. 


The f; need not be nonnegative, and they need not be integrable, but 
e **f(x) must be integrable over [0,) for s > sp. 


Proor. For the nonnegative case, apply the theorem to the probability 
densities g(x)=e **f,(x)/m, where m= {je -°**f,(x) dx, i= 1,2. For the 
general case, prove that f} +f, =f} +f; almost everywhere. a 


Example 22.1. If u,* u= 3, then the corresponding transforms (22.4) 
satisfy M(s)M,(s)=M,(s) for s>0. If u; is the Poisson distribution with 
mean À;, then (see (21.27)) M(s) = exp[A,(e~* — 1)]. It follows by Theorem 
22.2 that if two of the 41; are Poisson, so is the third, and A, +A, = À}. a 


Kolmogorov’s Zero—One Law 


Consider the set A of w for which n~'D/_,X,(w) > 0 as n > ©, For each 
m, the values of X,(w),..., X,,_\(w) are irrelevant to the question of 
whether or not w lies in A, and so A ought to lie in the o-field 
O( Xin» Xm+1---). In fact, lim, n~'D"=|X,(w) = 0 for fixed m, and hence w 
lies in A if and only if lim, n~'L?_,, X,(w) =0. Therefore, 


<el, 


the first intersection extending over positive rational e. The set on the inside 


an 3 X (w) 


k=m 


(22.8) A= M U MANo: 


€ N>mn=N 


SECTION 22. SUMS OF INDEPENDENT RANDOM VARIABLES 287 


lies in o(X,,,, X,, 4;,---), and hence so does A. Similarly, the w-set where the 
series £, X,„(w) converges lies in each o(X,,, X,,415-++): 

The intersection J= (1\F_,o(X,, X,,,,,-..) is the tail o-field associated 
with the sequence X,, X,,...; its elements are tail events. In the case 
X,, = 14, this is the o-field (4.29) studied in Section 4. The following general 
form of Kolmogorov’s zero—one law extends Theorem 4.5. 


Theorem 22.3. Suppose that {X,} is independent and that A € J= 
N*_,o0(X,, Xna). Then either PCA) =0 or P(A) = 1. 


Proor. Let Fo = UZ_,o(X),..., X,). The first thing to establish is that 
SF, is a field generating the o-field o(X,, X),...). If B and C lie in Fe, 
then Beo(X),...,X;) and C E(X ir, X,) for some j and k; if m= 
max{j,k}, then B and C both lie in o(X,,...,X,,), so that BOCE 
oX is- Xm) C Fy. Thus Fy is closed under the formation of finite unions; 
since it is similarly closed under complementation, Z is a field. For 
He R', [X EH]E A,co(F,), and hence X, is measurable o(.A); thus 
A, generates o(X,, X,,...) (which in general is much larger than Fay 

Suppose that A lies in 7. Then A lies in o(X,,,, X,.>,---) for each k. 
Therefore, if B E€ o(X,,..., X,), then A and B are independent by Theorem 
20.2. Therefore, A is independent of Zy and hence by Theorem 4.2 is also 
independent of o(X,, X,,...). But then A is independent of itself: P(A N 
A) = P(A)P(A). Therefore, P(A) = P?(A), which implies that P(A) is ei- 
ther 0 or 1. m 


As noted above, the set where Ł,„X,„(w) converges satisfies the hypothesis 
of Theorem 22.3, and so does the set where n~'D?_,X,(w) — 0. In many 
similar cases it is very easy to prove by this theorem that a set at hand must 
have probability either 0 or 1. But to determine which of 0 and 1 is, in fact, 
the probability of the set may be extremely difficult. 


Maximal Inequalities 


Essential to the study of random series are maximal inequalities—inequali- 
ties concerning the maxima of partial sums. The best known is that of 
Kolmogorov. 


Theorem 22.4. Suppose that X,,..., X,, are independent with mean 0 and 
finite variances. For a> 0, 


1 
(22.9) P| max |S,|>a < zz Var[S,,]. 


l<sksn 


288 RANDOM VARIABLES AND EXPECTED VALUES 


Proor. Let A, be the set where |S,| >a but |S,| <a for j <k. Since the 
A, are disjoint, 


Elsie ry S? dP 
a} “Ay 
=k [S + 25,(5, = 5p) + (8, = 54) | OP 
ah Ak 
> rs [52+ 25,(S, —S,)] 4P. 
Sl Ak 
Since A, and $, are measurable o(X,,..., X,) and S,, —S, is measurable 
o(X,.;,-.-., X,,), and since the means are all 0, it follows by (21.19) and 


independence that /, S,(S, —S,) dP = 0. Therefore, 
EIS E E | Sars ery 
k=1° Ag k=] 
=a°P| max Sela). a 
1<k<n 


By Chebyshev’s inequality, P[|S,|>a]<a~* Var{S,]. That this can be 
strengthened to (22.9) is an instance of a general phenomenon: For sums of 
independent variables, if max, —,,|S,| is large, then |S,| is probably large as 
well. Theorem 9.6 is an instance of this, and so is the following result, due to 
Etemadi. 


Theorem 22.5. Suppose that X,,..., X„ are independent. For a> Q, 


(22.10) PI max IS,12 3a <3 max P[|S;|>a]. 


l<k<n 
Proor. Let B, be the set where |S,|> 3a but |S;|<3a for j <k. Since 
the B, are disjoint, 


m= íl 


P| max IS, 23a <P[IS 2a] + X P(B,A [IS,|<a]) 
l<k<n Jail 


n=l 


<P([|S,|=a]+ E RM [IS, =S 2a) 


n-li 


=P(|S,|>a] ar Dy P(B,)P[IS, — S,|> 2a] 
k=1 

<P(|S,|>a] + max P{\S, — S,l= 2a] 

< P[|S,|>a] + max (P[IS,|> a] + P[IS,|=@]) 
sksn 


<3 max P[|S,|>a]. a 
l<k<n 


SECTION 22. SUMS OF INDEPENDENT RANDOM VARIABLES 289 


If the X, have mean 0 and Chebyshev’s inequality is applied to the right 
side of (22.10), and if æ is replaced by a/3, the result is Kolmogorov’s 
inequality (22.9) with an extra factor of 27 on the right side. For this reason, 
the two inequalities are equally useful for the applications in this section. 


Convergence of Random Series 


For independent X,,, the probability that 1 X,, converges is either 0 or 1. It is 
natural to try and characterize the two cases in terms of the distributions of 
the individual X,,. 


Theorem 22.6. Suppose that {X,,} is an independent sequence and E| X,,| = 
0. If X Varl X,,] <œ, then ZX, converges with probability 1. 


Proof. By (22.9), 
P| max sisle] < E ye Var[ X,, +x]. 


Since the sets on the left are nondecreasing in r, letting r — © gives 


1 co 
P suplS se = Sal > €| <s D Va A 
Gane 


k21 


Since © Var[ X,,] converges, 


(22.11) lim P 


n 


supid pei Se ee e] =0 
k>1 


for each e. 

Let E(n,e) be the set where sup,,,,,|S;—S,|>2e, and put E(e)= 
,£(n,e). Then E(n,e)| E(e), and (22.11) implies P(E(e))=0. Now 
U .E(e), where the union extends over positive rational e, contains the set 
where the sequence {S,} is not fundamental (does not have the Cauchy 
property), and this set therefore has probability 0. a 

Example 22.2. Let X,(w)=r,(w)a,, where the r, are the Rademacher 
functions on the unit interval—see (1.13). Then X,, has variance a2, and so 
La; <œ% implies that Lr,(w)a,, converges with probability 1. An interesting 
special case is a, =n '. If the signs in E +n! are chosen on the toss of a 
coin, then the series converges with probability 1. The alternating harmonic 
series 1— 2~'+3~!+ --- is thus typical in this respect. a 


If ZX, converges with probability 1, then S,, converges with probability 1 
to some finite random variable S. By Theorem 20.5, this implies that 


290 RANDOM VARIABLES AND EXPECTED VALUES 


Sh >p S. The reverse implication of course does not hold in general, but it 
does if the summands are independent. 


Theorem 22.7. For an independent sequence {X,,}, the S, converge with 
probability 1 if and only if they converge in probability. 


Proor. It is enough to-show that if S, >, S, then {S,} is fundamental 
with probability 1. Since 


P[|Syaj— Sal =] € PUI, SE A +P||S,—8 > 5l 
Sp >p S implies 
(22.12) lim sup P[|S,,,; -Sal =e] =0 
2 Gen 
But by (22.10), 
P| max Sn) — Sul €| <3 max P[IS,.)—S,12 alk 


and therefore 


P SUPIS 1x = Syl > €| <3supP||Sp+x-S,12 s]. 


k21 k21 


It now follows by (22.12) that (22.11) holds, and the proof is completed as 
before. a 


The final result in this direction, the three-series theorem, provides neces- 
sary and sufficient conditions for the convergence of LX, in terms of the 
individual distributions of the X,. Let X be X, truncated at c: XOS 
ye | 


n° [Xal sce] 


Theorem 22.8. Suppose that {X} is independent, and consider the three 
series 


(2213) LP Aneel, ., Rae Ane | So Varleele 


In order that XX, converge with probability 1 it is necessary that the three 


series converge for all positive c and sufficient that they converge for some 
positive c. 


SECTION 22. SUMS OF INDEPENDENT RANDOM VARIABLES 291 


PROOF OF SUFFICIENCY. Suppose that the series (22.13) converge, and 
put mi = E[ X‘]. By Theorem 22.6, E(X — m) converges with probabil- 
ity 1, and since Lm‘ converges, so does LX“, Since PIX, +X% i.0.] =0 
by the first Borel—Cantelli lemma, it follows finally that LX, converges with 
probability 1. = 


Although it is possible to prove necessity in the three-series theorem by 
the methods of the present section, the simplest and clearest argument uses 
the central limit theorem as treated in Section 27. This involves no circularity 
of reasoning, since the three-series theorem is nowhere used in what follows. 


PRooF oF Necessity. Suppose that ZX, converges with probability 1, 
and fix c>0. Since X, > 0 with probability 1, it follows that LX‘° con- 
verges with probability 1 and, by the second Borel—Cantelli lemma, that 
LPIA Sc] <e®. 

Let M<° and s“) be the mean and standard deviation of S“ = L}, Xf. 
If s‘ + œ, then since the X°° — m© are uniformly bounded, it follows by 
the central limit theorem (see Example 27.4) that 

SO -MO | 


: ` k> 1 y -t2 /2 
(22.14) lim P| x < O ae |e dt. 


And since 1X“ converges with probability 1, s“? > © also implies S1 /s©? 
— 0 with probability 1, so that (Theorem 20.5) 


(22.15) lim P| |S /s©| > €] =0. 


But (22.14) and (22.15) stand in contradiction: Since 


SO — M&) So) 
IP) 52 E AO <y, E <E 
S S 
n n 


is greater than or equal to the probability in (22.14) minus that in (22.15), it is 
positive for all sufficiently large n (if x <y). But then 


x—e< -M/s <y +e, 


and this cannot hold simultaneously for, say, (x — €, y +€)=(—1,0) and 
(x — €, y + e) = (0,1). Thus s“ cannot go to œ, and the third series in (22.13) 
converges. 

And now it follows by Theorem 22.6 that D(X{° — m) converges with 
probability 1, so that the middle series in (22.13) converges as well. p 


Example 22.3. If X,=r,a,, where r, are the Rademacher functions, 
then Za? <œ implies that LX, converges with probability 1. If LX, con- 
verges, then a, is bounded, and for large c the convergence of the third 


292 RANDOM VARIABLES AND EXPECTED VALUES 


Series in (22.13) implies La? < œ: If the signs in E + a,„ are chosen on the toss 
of a coin, then the series converges with probability 1 or 0 according as La? 
converges or diverges. If La? converges but Lla,,| diverges, then E +a, is 
with probability 1 conditionally but not absolutely convergent. a 


Example 22.4. If a, 10 but La? = œ, then E +a, converges if the signs 
are strictly alternating, but diverges with probability 1 if they are chosen on 
the toss of a coin. a 


Theorems 22.6, 22.7, and 22.8 concern conditional convergence, and in the 
most interesting cases, L X,, converges not because the X, go to 0 at a high 
rate but because they tend to cancel each other out. In Example 22.4, the 
terms cancel well enough for convergence if the signs are strictly alternating, 
but not if they are chosen on the toss of a coin. 


Random Taylor Series* 


Consider a power series È +z”, where the signs are chosen on the toss of a 
coin. The radius of convergence being 1, the series represents an analytic 
function in the open unit disk D,=[z: |z|< 1] in the complex plane. The 
question arises whether this function can be extended analytically beyond 
Do. The answer is no: With probability 1 the unit circle is the natural 
boundary. 


Theorem 22.9. Let {X,} be an independent sequence such that 
(22.16) rx, =1]=P[X,=-1]=}, n=O 


There is probability 0 that 
(22.17) F(w,z)= X X,(w)z" 
n=0 


coincides in Do with a function analytic in an open set properly containing Do. 
It will be seen in the course of the proof that the w-set in question lies in 
o(X,, X,,...) and hence has a probability. It is intuitively clear that if the set 
is measurable at all, it must depend only on the X, for large n and hence 
must have probability either 0 or 1, 
Proor. Since 


(22.18) |X,(w)|=1,  nm=0,1,... 


“This topic, which requires complex variable theory, may be omitted, 


SECTION 22. SUMS OF INDEPENDENT RANDOM VARIABLES 293 


with probability 1, the series in (22.17) has radius of convergence 1 outside a 
set of measure 0. 

Consider an open disk D =[z: |z — ¢| <r], where ¢ € D, and r>0. Now 
(22.17) coincides in Do with a function analytic in D, U D if and only if its 
expansion 


F(w,2)= E gp Fo, )(2-6)" 


m=(0 


about ¢ converges at least for |z — ¢| <r. Let Ap be the set of w for which 
this holds. The coefficient 


o0 


a,,(@) = LFO w, ¢) = DD Ho 


n=m 


is a complex-valued random variable measurable o(X,,, X,,41,---). By the 
root test, œ E A, if and only if limsup,, |a,,(@)|'/" <r. For each mo, the 
condition for w € A, can thus be expressed in terms of a,,(@), @m,41(@),--- 
alone, and so Ap €o(X,,,, Xm,419---)- Thus Ap has a probability, and in 
fact P(A,) is 0 or 1 by the zero-one law. 

Of course, P(A,) = 1 if D C Do. The central step in the proof is to show 
that P(A,) =0 if D contains points not in Dy. Assume on the contrary that 
P(A,) =1 for such a D. Consider that part of the circumference of the unit 
circle that lies in D, and let k be an integer large enough that this arc has 


length exceeding 277/k. Define 


X,(w) ifn#0(mod k), 
Y,(@) = 
—X,(@) itz = 0 (modtayr 

Let Bp be the w-set where the function 


(22.19) G(w,z)= S Y,(w)z” 
n=0 


coincides in D, with a function analytic in Dy UD. 

The sequence {Y}, Y,,...} has the same structure as the original sequence: 
the Y, are independent and assume the values +1 with probability + each. 
Since B, is defined in terms of the Y, in the same way as Ap is defined in 
terms of the X,, it is intuitively clear that P(Bp) and P(Ap) must be the 
same. Assume for the moment the truth of this statement, which is somewhat 
more obvious than its proof. 

If for a particular w each of (22.17) and (22.19) coincides in Dy with a 


function analytic in D, U D, the same must be true of 


(22.20) F(w,z) —G(,z) =2 x Xe @lvahe: 
0 


m= 


294 RANDOM VARIABLES AND EXPECTED VALUES 


Let D,=[ze?7!!/*: z € D]. Since replacing z by ze?7'/* leaves the function 
(22.20) unchanged, it can be extended analytically to each DjUD,, l= 
1,2,.... Because of the choice of k, it can therefore be extended analytically 
to [z: |z|< 1+] for some positive e; but this is impossible if (22.18) holds, 
since the radius of convergence must then be 1. 

Therefore, Ap O Bp cannot contain a point œw satisfying (22.18). Since 
(22.18) holds with probability 1, this rules out the possibility P(A p) = P(B,) 
= 1 and by the zero—one law leaves only the possibility P(A p) = P(B,) = 0. 
Let A be the w-set where (22.17) extends to a function analytic in some open 
set larger than Dy. Then w € A if and only if (22.17) extends to D, UD for 
some D =[z: |z—¢|<r] for which D- D+ Ø, r is rational, and ¢ has 
rational real and imaginary parts; in other words, A is the countable union of 
Ap for such D. Therefore, A lies in (Xo, X,,...) and has probability 0. 

It remains only to show that P(A,) = P(B,), and this is most easily done 
by comparing {X,} and {Y,} with a canonical sequence having the same 
structure. Put Z,(w) =(X,(@) + 1)/2, and let Tw be L”_)Z,(w)2~"~' on 
the w-set A* where this sum lies in (0,1]; on Q—A* let Tw be 1, say. 
Because of (22.16) PCA*)=1. Let F= o (Xp Xir) and let @ be the 
o-field of Borel subsets of (0, 1]; then T: Q > (0,1] is measurable F/ Z. Let 
r,(x) be the nth Rademacher function. If M=[x: r(x) =u,, i=1,...,n], 
where u; = +1 for each i, then P(T"'M) = Plo: X,(w) =u,, i=0,1,...,n — 
1]=2~”, which is the Lebesgue measure A(M) of M. Since these sets form a 

m-system generating Z, P(T~'M) =A(M) for all M in @ (Theorem 3.3). 

Let Mp be the set of x for which LU? _or,,,(%)z” extends analytically to 
Do UD. Then Mp lies in Ø, this being a special case of the fact that A, lies 
in F. Moreover, if w E A*, then w E Ap if and only if Tw E Mp: A* N Ap = 
A* \T~'Mp. Since P(A*) = 1, it follows that P(A p) =A(M,). 

This argument only uses (22.16), and therefore it applies to {Y,} and Bp as 
well. Therefore, P(Bp) = A(Mp) = P(A p). a 


PROBLEMS 


22.1. Suppose that X,, X,,... is an independent sequence and Y is measurable 
o(X,,,X,+41,---) for each n. Show that there exists a constant a such that 
PUY =g\=1, 


22.2. Assume {X,,} independent, and define X“° as in Theorem 22.8. Prove that for 
L|X,,| to converge with probability 1 it is necessary that EPIX, >] and 
LE(|X{°l] converge for all positive c and sufficient that they converge for some 
positive c. If the three series (22.13) converge but LE[|X°|] = œ, then there is 
probability 1 that Z X,„ converges conditionally but not absolutely. 


22.3. ft (a) Generalize the Borel—Cantelli lemmas: Suppose X,, are nonnegative 
If LE[X,] < œ, then ZX, converges with probability 1. If the X,, are indepen: 
dent and uniformly bounded, and if LE[X,]=~, then £ X,, diverges witt 
probability 1. 


SECTION 22. SUMS OF INDEPENDENT RANDOM VARIABLES 295 


22.8. 


22.9. 


22.10. 


22.11. 


22.12. 


(b) Construct independent, nonnegative X,, such that EX, converges with 
probability 1 but LE[X,] diverges. For an extreme example, arrange that 
P{ X, > 0 i.o.] = 0 but ELX,] = 0. 


. Show under the hypothesis of Theorem 22.6 that EX, has finite variance and 


extend Theorem 22.4 to infinite sequences. 


5. 20.14 22.17 Suppose that X,, X,,... are independent, each with the Cauchy 


distribution (20.45) for a common value of u. 


(a) Show that n` 'X}_;X, does not converge with probability 1. Contrast with 
Theorem 22.1. 


(b) Show that P[n~' max, <n X, <x] > e~“/™* for x > 0. Relate to Theorem 
14.3. 


. If Xi, X5,... are independent and identically distributed, and if P[ X, > 0] = 1 


and P[X,>0]>0, then £„,X„,=œ with probability 1. Deduce this from 
Theorem 22.1 and its corollary and also directly: find a positive e such that 
X,, > € infinitely often with probability 1. 


-7. Suppose that X,, X5,... are independent and identically distributed and 


E[|X,|] =. Use (21.9) to show that L, P[|X,|=an]=% for each a, and 
conclude that sup, n~ '|X,,| = with probability 1. Now show that sup, n~ ‘|S, 
= œ with probability 1. Compare with the corollary to Theorem 22.1. 


Wald’s equation. Let X,, X>,... be independent and identically distributed 
with finite mean, and put S, =X, + +- +X, Suppose that 7 is a stopping 
time: 7 has positive integers as values and [r = n] €o(Xj,..., X,,); see Section 
7 for examples. Suppose also that E[r] < œ. 


(a) Prove that 
(22.21) E[S,|)=£E[X,)2lc)- 


(b) Suppose that X, is +1 with probabilities p and q, p #q, let r be the first 
n for which S, is —a or b (a and b positive integers), and calculate E[r]. This 
gives the expected duration of the game in the gambler’s ruin problem for 
unequal p and q. 


20.91 Let Z, be 1 or 0 according as at time n there is or is not a record in 
the sense of Problem 20.9. Let R, =Z, +: +Z, be the number of records 
up to time n. Show that R,,/logn >p l. 


22.17 (a) Show that for an independent sequence {X,,} the radius of conver- 
gence of the random Taylor series L,,X,,z" is r with probability 1 for some 
nonrandom r. 

(b) Suppose that the X, have the same distribution and P[X, # 0] > 0. Show 
that r is 1 or 0 according as log*|X,| has finite mean or not. 


Suppose that Xo, X,,... are independent and each is uniformly distributed 
over [0,27]. Show that with probability 1 the series L,,e'*"z" has the unit 
circle as its natural boundary. 


Prove (what is essentially Kolmogorov’s zero—one law) that if A is independent 
of a m-system Y and A €o(#), then P(A) is either 0 or 1. 


296 


22.13. 


22.15. 


RANDOM VARIABLES AND EXPECTED VALUES 


Suppose that .7 is a semiring containing Q. 

(a) Show that if P(A A B) < bP(B) for all B € 7, and if b < 1 and A €a(#), 
then P(A) =0. 

(b) Show that if P(AMB)<P(A)P(B) for all BE Æ, and if A Eol Æ), 
then P(A) is 0 or 1. 

(c) Show that if aP(B) < P(A OB) for all B € Z, and if a > 0 and A Eo( £), 
then P(A)=1. 

(d) Show that if P(A)P(B) <P(AQB) for all BE, and if A Eol), 
then P(A) is 0 or 1. 

(e) Reconsider Problem 3.20. 


. 22.127  Burstin’s theorem. Let f be a Borel function on [0, 1] with arbitrarily 


small periods: For each e there is a p such that 0 <p <e and f(x) =f(x +p) 
for 0 <x < 1 — p. Show that such an f is constant almost everywhere: 


(a) Show that it is enough to prove that P(f~'B) is 0 or 1 for every Borel set 
B, where P is Lebesgue measure on the unit interval. 


(b) Show that f~'B is independent of each interval [0, x], and conclude that 
PU WB) is 0 or 1. 
(c) Show by example that f need not be constant. 
Assume that X,,..., X,, are independent and s,t,a are nonnegative. Let 

L(s)= max P[|S, >s], R(s) = maxP[|S, —S,|>s], 

<n k<n 
M(s) = P| maxlS;|2 s| S) =e] Salas 
<n 


(a) Following the first part of the proof of (22.10), show that 
(22.22) Mis +t = 1) Ms Sores): 


(b) Take s= 2a and t =a; use (22.22), together with the inequalities T(s) < 
L(s) and RQs) < 2L(s), to prove Etemadi’s inequality (22.10) in the form 


(22.23) M(3a) <B,(a)=1A3L(a). 


(c) Carry the rightmost term in (22.22) to the left side, take s =t =a, and 
prove Oftaviani’s inequality: 


(22.24) M(2a) <Bo(a)=1A TR: 
(d) Prove 


By; (a) <3Bo(a/2), Bo(a)< 3B, (a@/6). 


This shows that the Etemadi and Ottaviani inequalities are of the same power 
for most purposes (as, for example, for the proofs of Theorem 22.7 and (37.9). 
Etemadi’s inequality seems the more natural of the two. Neither inequality can 
replace (9.39) in the proof of the law of the iterated logarithm. 


i 


SECTION 23. THE POISSON PROCESS 297 
SECTION 23. THE POISSON PROCESS 


Characterization of the Exponential Distribution 


Suppose that X has the exponential distribution with parameter a: 
(23.1) PL xX SX) Se = 4 2.0, 

The definition (4.1) of conditional probability then gives 

(23.2) P[X>x+yl|X>x]=P[X>y], Xs) 20; 


Image X as the waiting time for the occurrence of some event such as the 
arrival of the next customer at a queue or telephone call at an exchange. As 
observed in Section 14 (see (14.6)), (23.2) attributes to the waiting-time 
mechanism a lack of memory or aftereffect. And as shown in Section 14, the 
condition (23.2) implies that X has the distribution (23.1) for some positive 
a. Thus if in the sense of (23.2) there is no aftereffect in the waiting-time 
mechanism, then the waiting time itself necessarily follows the exponential 
law. 


The Poisson Process 


Consider next a stream or sequence of events, say arrivals of calls at an 
exchange. Let X, be the waiting time to the first event, let X, be the waiting 
time between the first and second events, and so on. The formal model 
consists of an infinite sequence X,, X,,... of random variables on some 
probability space, and S, =X, + °:: +X, represents the time of occurrence 
of the nth event; it is convenient to write Sọ = 0. The stream of events itself 
remains intuitive and unformalized, and the mathematical definitions and 
arguments are framed in terms of the X,,. 

If no two of the events are to occur simultaneously, the S, must be strictly 
increasing, and if only finitely many of the events are to occur in each finite 
interval of time, S, must go to infinity: 


(233)  0=S(w) <S\(w) <S,(w) < +++, — supS,(w) =. 


This condition is the same thing as 


(23.4) K,(@)>0, Xo( oO) as LX (@) =. 


Throughout the section it will be assumed that these conditions hold every- 
where—for every w. If they hold only on a set A of probability 1, and if 
X,(w) is redefined as X,(w)=1, say, for w £A, then the conditions hold 
everywhere and the joint distributions of the X, and S, are unaffected. 


298 RANDOM VARIABLES AND EXPECTED VALUES 
Condition 0°. For each w, (23.3) and (23.4) hold. 


The arguments go through under the weaker condition that (23.3) and 
(23.4) hold with probability 1, but they then involve some fussy and uninter- 
esting details. There are at the outset no further restrictions on the X;; they 
are not assumed independent, for example, or identically distributed. 

The number N, of events that occur in the time interval [0, ¢] is the largest 
integer n such that S, < t: 


(23.5) N, = max[ni Sa 7]: 


Note that N, = 0 if t < S$, =X; in particular, Nọ = 0. The number of events 
in (s,¢] is the increment N, — N,. 


So = 0 SA 52 =X, +X, Sy =X, + X2 +X, 


From (23.5) follows the basic relation connecting the N, with the S,: 
(23.6) [N,=n]=[S, <t]. 
From this follows 
(23.7) [N/=n] =[9) sr 25y 9). 


Each N, is thus a random variable. 

The collection [N,: t > 0] is a stochastic process, that is, a collection of 
random variables indexed by a parameter regarded as time. Condition 0° can 
be restated in terms of this process: 


Condition 0°. For each w, Nw) is a nonnegative integer for t > 0, No(w) = 0, 
and lim, _,,, N(w)=%; further, for each w, N(w) as a function of t is 
nondecreasing and right-continuous, and at the points of discontinuity the saltus 
N (w) — sup, <, N,(w) is exactly 1. 


It is easy to see that (23.3) and (23.4) and the definition (23.5) give random 
variables N, having these properties. On the other hand, if the stochastic 


SECTION 23. THE POISSON PROCESS 299 


process [N,: t > 0] is given and does have these properties, and if random 
variables are defined by S,(w)=inflt: N(w)>n] and X,(o)=S,(@) — 
S,,_\(w), then (23.3) and (23.4) hold, and the definition (23.5) gives back the 
original N,. Therefore, anything that can be said about the X, can be stated 
in terms of the N,, and conversely. The points Sw), S,(w),... of (0,~) are 
exactly the discontinuities of N,(w) as a function of t; because of the 
queueing example, it is natural to call them arrival times. 

The program is to study the joint distributions of the N, under conditions 
on the waiting times X, and vice versa. The most common model specifies 
the independence of the waiting times and the absence of aftereffect: 


Condition I°. The X,, are independent, and each is exponentially distributed 
with parameter a. 


In this case P[X,, > 0] = 1 for each n and n~'S, >a ' by the strong law 
of large numbers (Theorem 22.1), and so (23.3) and (23.4) hold with probabil- 
ity 1; to assume they hold everywhere (Condition 0°) is simply a convenient 
normalization. 

Under Condition 1°, §, has the distribution function specified by (20.40), 
so that P[N, >n]= L7_,e “(at)'/i! by (23.6), and 


t n 
(23.8) PLN, =n] =e $e) n= 0, Ieee: 


Thus N, has the Poisson distribution with mean at. More will be proved in a 
moment. 


Condition 2°. (i) For 0<t,< *': <t, the increments N,, N,,—N,».--»Ny, 
—N,,_, are independent. 
(ii) The individual increments have the Poisson distribution: 


free n 
(23.9) P[N,—N,=n] a EN) n=l... @ssee 


Since N, = 0, (23.8) is a special case of (23.9). A collection [N,: t > 0] of 
random variables satisfying Condition 2° is called a Poisson process, and a is 
the rate of the process. As the increments are independent by (i), if r<s <t, 
then the distributions of N,— N, and N,— N, must convolve to that of 
N, — N,. But the requirement is consistent with (ii) because Poisson distribu- 
tions with parameters u and v convolve to a Poisson distribution with 
parameter u +v. 


Theorem 23.1. Conditions 1° and 2° are equivalent in the presence of 
Condition 0°. 


300 


RANDOM VARIABLES AND EXPECTED VALUES 


PRoor oF 1° > 2°. Fix t, and consider the events that happen after time 
t. By (23.5), Sy, <t<Sy ,,, and the waiting time from ż¢ to the first event 
following t is Sy ,, — t; the waiting time between the first and second events 
following ¢ is Xy +2; and so on. Thus 


(23.10) Xt? = Sy 44 = ts AS = My as Xp) SANI e 


define the waiting times following t. By (23.6), N,,,— N,>m, or N,,, >N, + 
m, if and only if Sy ,,, <t + s, which is the same thing as XO + «+» +X 0 < 
s. Thus 


(23.11) Nias N, =max[m: XP + +X zs]. 


Hence [N,,,—N,=m]=[X(0+ +X¥M <5 <XO+---4+XO ] A 
comparison of (23.11) and (23.5) shows that for fixed t the random variables 
N.., — N, for s > 0 are defined in terms of the sequence (23.10) in exactly the 
same way as the N, are defined in terms of the original sequence of waiting 
times. 

The idea now is to show that conditionally on the event [N, =n] the 
random variables (23.10) are independent and exponentially distributed. 
Because of the independence of the X, and the basic property (23.2) of the 
exponential distribution, this seems intuitively clear. For a proof, apply 
(20.30). Suppose y > 0; if G, is the distribution function of S„ then since 
xX has the exponential distribution, 


n+] 


PAS At <5 co Snag bo Sel SA ty 


n Se 


=f P(X, t by — dG) 
xsl 


=e Saf 


x< 


=e SRS <i. XS tae 


Pl Xen >t—x]dG,(x) 
t 


By the assumed independence of the X,,, 
PIST OY, Kno Vanes eae, are Sy 
=P[S,,,—t>y,,8,st<S,. le 72+: a 
=a Pls. <1'<'S).., ler ace, 
If H =(y,,%) x +++ X (yj, œ), this is 


(23.12) P[N,=n, (X{P,..., XP) eH] =P[N, =n] P(A 


SECTION 23. THE POISSON PROCESS 301 


By Theorem 10.4, the equation extends from H of the special form above to 
all H in Z’. 

Now the event [N, =m,, 1 <i <u] can be put in the form [(X,,..., X) © 
Hi, i) + l and H is the set of x in R’ for which x, + ++- +X, < 
§,<X,+ °° +X ay, LSS But then Cire, et) seey (23.11) 
the same as the event [N,,, —N,=m,, 1 <i <ul. Thus (23.12) gives 


P| N, =”, Nias, N Sty dee = P[N,=n]P[N, =m,,1<isul]. 


From this it follows by induction on k that if 0=t) <t,< --- <t,, then 


(23.13) PIN aN, =n, 1st sk) =f eee ail: 


i=] 


Thus Condition 1° implies (23.13) and, as already seen, (23.8). But from 
(23.13) and (23.8) follow the two parts of Condition 2°. a 


Proof oF 2°—> 1°. If 2° holds, then by (23.6), PIX, > A- REN cS 
e “so that X, is exponentially distributed. To find the joint distribution of 
X, and X,, suppose that 0 <s; <t; <s, <t, and perform the calculation 


Pl Si S pS | 
=P(N, Pel N,, = ONG =1,N,,—-N, = ON N 


=e X a(t, #5 JET D K ea ene) 


A | Cae 0) = ||| e A 


eee 
S2 <Y2 b2 


Thus for a rectangle A contained in the open set G =[(y,, y2): 0< y; < y2], 
P[(S1, S2) €A] = f ae? dy, dys. 
A 


By inclusion-exclusion, this holds for finite unions of such rectangles and 
hence, by a passage to the limit, for countable ones. Therefore, it holds for 
A=GG' if C is open. Since the open sets form a 7-system generating the 
Borel sets, (S, S) has density a7e~“”? on G (of course, the density is 0 
outside G). 

By a similar argument in R* (the notation only is more complicated), 
(S,,...,5,) has density a*e~*” on [y: 0<y, < --- <y,]. If a linear trans. 
Sonnad g(y)=x is defined by x; ree Yes then CX. X,) = 
g(S,,...,5,) has by (20.20) the density rk ;ae °** (the Jacobian is ideit 
cally 1). This proves Condition 1°. a 


302 RANDOM VARIABLES AND EXPECTED VALUES 


The Poisson Approximation 


Other characterizations of the Poisson process depend on a generalization of 
the classical Poisson approximation to the binomial distribution. 


Theorem 23.2. Suppose that for each n, Z,),...,Zpy,, are independent 
random variables and Z,, assumes the values 1 and 0 with probabilities p,,, 
and 1 — Pank: If 


(23.14) L Pak SA So, max p,,— 0, 
k=1 l<k<r, 

then 
Th ri 

(23.15) P| È ui] ee i= Usd utes 
k=1 ; 


If A = 0, the limit in (23.15) is interpreted as 1 for i= 0 and 0 for i > 1. In 
the case where r, =n and p,, =A/n, (23.15) is the Poisson approximation to 
the binomial. Note that if A > 0, then (23.14) implies r, — ©. 


Ig:1 =P I\:p 
Jgo:e 7P J:e~Pp/1! Jz 


Proor. The argument depends on a construction like that in the proof of 
Theorem 20.4. Let U,,U,,... be independent random variables, each uni- 
formly distributed over [0, 1). For each p, 0 <p <1, split [0, 1) into the two 
intervals 7¿(p) = [0,1 - p) and 7;(p)=[1 —p, 1), as well as into the sequence 
of intervals JL p) = (eee en Diane oP i= 0, II, tee Define Var al 
if U,E1,(p,,) and V,, =0 if U, E ICPn). Then V,,,...,V,, are indepen- 
dent, and V,,, assumes the values 1 and 0 with probabilities P[U, €J,(p,,.] = 
Pr, and P[U, € 1 py,))=1—p,,. Since V,,,...,V,, have the same joint 
distribution as Z,,,,...,Z,,,,, (23.15) will follow if it is shown that V = Lee Vw 
satisfies 


i 


ee. 
(23.16) PIV, =i] >e. 


Now define W,,=1 if U,eJ(p,,); i=0,1,.... Then PIW, =i]= 
e Pp, /i!—W,, has the Poisson distribution with mean p,,. Since the W,, 


SECTION 23. THE POISSON PROCESS 303 


are independent, W, = Lir_,W,, has the Poisson distribution with mean 
Àn = Eg- Pre: Since 1 — p <e, Jp) cI,(p) (see the diagram). Therefore, 


P{ Vnk # Wak] = P| V, = hie Wir) = P[U; E€ 1( Dar) -J i(Pak)] 
= Pak ~ ESRD = Dabs 


and 


Fn 
P[V, #W,)< ds Bak Sp max Pre 
v=] l<k<r, 


by (23.14). And now (23.16) and (23.15) follow because 


PIW, =i] =e" Ji! ae Ji! a 


Other Characterizations of the Poisson Process 


The condition (23.2) is an interesting characterization of the exponential 
distribution because it is essentially qualitative. There are qualitative charac- 
terizations of the Poisson process as well. 

For each w, the function N,(w) has a discontinuity at t if and only if 
S (w) =t for some n > 1; t is a fixed discontinuity if the probability of this is 
positive. The condition that there be no fixed discontinuities is therefore 


(23.17) PS. mtm O0 y ta e 


that is, each of S,,S,,... has a continuous distribution function. Of course 
there is probability 1 (under Condition 0°) that N,(w) has a discontinuity 
somewhere (and indeed has infinitely many of them). But (23.17) ensures that 
a t specified in advance has probability 0 of being a discontinuity, or time of 
an arrival. The Poisson process satisfies this natural condition. 


Theorem 23.3. If Condition 0° holds and [N,: t>0] has independent 
increments and no fixed discontinuities, then each increment has a Poisson 
distribution. 

This is Prékopa’s theorem. The conclusion is not that [N,: t > 0] is a 
Poisson process, because the mean of N,— N, need not be proportional to 
t—s. If ọ is an arbitrary nondecreasing, continuous function on [0,0) and 
y(0)=0, and if [N,: t>0] is a Poisson process, then Nay satisfies the 
conditions of the theorem.' 


Proor. The problem is to show for t'< t” that N»— N, has for some 
A > 0 a Poisson distribution with mean A, a unit mass at 0 being regarded as 
a Poisson distribution with mean 0. 


"This is in fact the general process satisfying them; see Problem 23.8. 


304 RANDOM VARIABLES AND EXPECTED VALUES 


The procedure is to construct a sequence of partitions 
(23.18) Om ty Shay S 1 Sy 


of [r', ¢”] with three properties. First, each decomposition refines the preced- 
ing one: each f,, isa f,,,,;. Second, 


(23.19) be PLN, = dee IT 
k=1 


for some finite A and 


(23.20) max PN, = Ni, ., © A 
l<k<r, pete 

Third, 

= p| max (N, AN 2| gy 


Once the partitions have been constructed, the rest of the proof is easy: 
Let Z,, be 1 or 0 according as N, «~ Naa, IS positive or not. Since [N,: 
t > 0] has independent increments, the Tek, are independent for — n. By 
Theorem 23.2, therefore, (23.19) and (23.20) imply that Z, = Lir_,Z,,, satis- 
fies P[Z, =i] e~“A'/i! Now N,,— N, > Z,, and there is strict nog alice if 
and only if N, Nin Z 2 for some k. Thus (23.21) implies P[ N, — N, + 
Z,]—0, and therefore PLN» — N, =i] =e7*A‘/i! 

To construct the partitions, consider for each t the distance D, = = inf, > 1 
it—S,,| from t to the nearest arrival time. Since Sm ™, the ia is 
eed Further, D,=0 if and only if S, =t for some m, and since by 
hypothesis there are no fixed discontinuities, the probability of this is 0: 

P{D,=0]=0. Choose ô, so that 0<6,<n™! and PID, <6,])<n7'. The 
iude G= LFS) for t' <t <t" cover [t’,t”]. Choose a finite aber, 
and in (23.18) take the tak for 0 <k <r, to be the endpoints (of intervals in 
the subcover) that are contained in (t’, t"), By the construction, 


(2322) max (t,, = be) =O: 


l<k<r, 


and the probability that (4, ,_,,¢,,,] contains some Sm is less than n- *. This 
gives a sequence of partitions satisfying (23.20). Inserting more points in a 
partition cannot increase the maxima in (23.20) and (23. 22), and so it can be 
arranged that each partition refines the preceding one. 

To prove (23.21) it is enough (Theorem 4.1) to show that the limit superior 
of the sets involved has probability 0. It is in fact empty: If for infinitely many 
n, N,,(w)—N,_, (w)2 2 holds for some k <r,,, then by (23.22), N,(w) as a 


== n? 


SECTION 23. THE POISSON PROCESS 305 


function of t has in [t’, t”] discontinuity points (arrival times) arbitrarily close 
together, which requires S,,(w) €[t’, t’] for infinitely many m, in violation of 
Condition 0°. 

It remains to prove (23.19). If Z,, and Z, are defined as above and 
Par = P(Z,, = 1], then the sum in (23.19) is X p,, = EIZ Since Z,., = Z,5 
ÈL Pag IS nondecreasing in n. Now 


P[ Nw» - N,=0] = P[Z,,=0, k <r] 


Fn 


Fe [I C- Dre) Se P7 LD 
k=1 


k=1 


If the left-hand side here is positive, this puts an upper bound on L; Pank, and 
(23.19) follows. But suppose P[N, — N, =0]=0. If s is the midpoint of ¢’ 
and t”, then since the increments are independent, one of P[N, —N,=0] 
and P{N,,—N,=0] must vanish. It is therefore possible to find a nested 
sequence of intervals [u,,,v,] such that v,,—u,,—20 and the event A, = 
[N —N,, = 1] has probability 1. But then P((Q,,A,,) =1, and if f is the 
point common to the [u,,,v,,], there is an arrival at t with probability 1, 
contrary to the assumption that there are no fixed discontinuities. a 


Theorem 23.3 in some cases makes the Poisson model quite plausible. The 
increments will be essentially independent if the arrivals to time s cannot 
seriously deplete the population of potential arrivals, so that N, has for t >s 
negligible effect on N,—WN,. And the condition that there are no fixed 
discontinuities is entirely natural. These conditions hold for arrivals of calls 
at a telephone exchange if the rate of calls is small in comparison with the 
population of subscribers and calls are not placed at fixed, predetermined 


times. If the arrival rate is essentially constant, this leads to the following 
condition. 


Condition 3°. (i) For 0<t,< +--+: <t, the increments Ni > Np en tes 
N,, — N,,_, are independent. 


(ii) The distribution of N, — N, depends only on the difference t — s. 


Theorem 23.4. Conditions 1°, 2°, and 3° are equivalent in the presence of 
Condition 0°. 


Proor. Obviously Condition 2° implies 3°. Suppose that Condition 3° 
holds. If J, is the saltus at ¢ (J,=N,— sup, <, N,), then[N,-N,_,-1>1]] 
[J, > 1], and it follows by (ii) of Condition 3° that P[J, > 1] is the same for all 
t. But if the value common to P[J, > 1] is positive, then by the independence 
of the increments and the second Borel—Cantelli lemma there is Probability 1 
that J, > 1 for infinitely many rational ¢ in (0,1), for example, which contra- 
dicts Condition 0°. 


306 RANDOM VARIABLES AND EXPECTED VALUES 


By Theorem 23.3, then, the increments have Poisson distributions. If f(t) 
is the mean of N,, then N,-— N, for s<t must have mean f(t) — f(s) and 
must by (ii) have mean f(t—s); thus f(t)=f(s)+f(t—s). Therefore, f 
satisfies Cauchy’s functional equation [A20] and, being nondecreasing, must 
have the form f(t) = at for a > 0. Condition 0° makes «æ = 0 impossible. & 


One standard way of: deriving the Poisson process is by differential 
equations. 


Condition 4°. If 0<t,< ++: <t, and if n,,...,n, are nonnegative integers, 
then 

(23.23) PINi wn Ny = IIN, =n; 7 Sk] =ah +o0(h) 

and 

(23.24) PIN en-N, 2 21N, =1;55 Sk] =o(h) 


as h | 0. Moreover, [N,: t > 0] has no fixed discontinuities. 


The occurrences of o(h) in (23.23) and (23.24) denote functions, say bh), 
and (h), such that h~'¢,(h) > 0 as h 10; the $, may depend a priori on 
k,t),...,t,, and n,,...,n, as well as on A. It is assumed in (23.23) and 
(23.24) that the conditioning events have positive probability, so that the 
conditional probabilities are well defined. 


Theorem 23.5. Conditions 1° through 4° are all equivalent in the presence 
of Condition 0°. 


PRooF OF 2° > 4°. For a Poisson process with rate a, the left-hand sides 
of (23.23) and (23.24) are e~*"ah and Leem —e ah, and these are 
ah + o(h) and o(h), respectively, because e~*" = 1 — ah + o(h). And by the 
argument in the preceding proof, the process has no fixed discontinuities. 


PROOF OF 4° 2°. Fix k, the t;, and the n,; denote by A the event 


[N =n; j<k], and for t> 0 put P,(t)= PIN, ,,—N, =nlA]. It will be 
shown that a É 


t n 
(23.25) palt) =~ at) _ m=O 


SECTION 23. THE POISSON PROCESS 307 


This will also be proved for the case in which p,(t) = P[N, =n]. Condition 2° 
will then follow by induction. 
If t>O and |t—s|<n~', then 


|P[N, =n] — P(N, ="]|<P[N,4#N,] < PLN sr- N12 1). 


As n > œ, the right side here decreases to the probability of a discontinuity 
at ¢, which is 0 by hypothesis. Thus P[N, =n] is continuous at t. The same 
kind of argument works for conditional probabilities and for f=0, and so 
p,(t) is continuous for t > 0. 

To simplify the notation, put D, =N, +,- N, If D,4,=n, then D,=m 
for some m <n. If t > 0, then by the rules for conditional probabilities, 


p,(t +h) =p,(t)P[D, +» =),=0\A N [D,=n]] 
tPa- i(t) P[D an —D, MAN =a Se 


a Emro h N [D,=m]]. 


m=0 


For n < 1, the final sum is absent, and for n = 0, the middle term is absent as 
well. This holds in the case p,(t) = P[N,=n]if D, =N, and A =Q. (If t=0, 
some of the conditioning events here are empty; hence the assumption f¢ > 0.) 
By (23.24), the final sum is o(h) for each fixed n. Applying (23.23) and (23.24) 
now leads to 


p,(t+h) =p,(t)(1- ah) +p,_\(t)ah + o(h), 
and letting h | 0 gives 
(23.26) p(t) = —ap,(t) + ap,_\(t). 


In the case n = 0, take p__{(t) to be identically 0. In (23.26), t > 0 and p/ (t) is 
a right-hand derivative. But since p,(t) and the right side of the equation are 
continuous on [0,%), (23.26) holds also for t= 0 and p/(t) can be taken as a 
two-sided derivative for t > 0 [A22]. 

Now (23.26) gives [A23] 


P,(t) = ep, (0) pa TEO as. 


Since p,(0) is 1 or 0 as n = 0 or n > 0, (23.25) follows by induction on n. m 


308 RANDOM VARIABLES AND EXPECTED VALUES 


Stochastic Processes 


The Poisson process [N,: t > 0] is one example of a stochastic process—that 
is, a collection of random variables (on some probability space (Q, F, P)) 
indexed by a parameter regarded as representing time. In the Poisson case, 
time is continuous. In some cases the time is discrete: Section 7 concerns the 
sequence {F,} of a gambler’s fortunes; there n represents time, but time that 
increases in jumps. 

Part of the structure of a stochastic process is specified by its finite-dimen- 
sional distributions. For any finite sequence f,,...,¢, of time points, the 
k-dimensional random vector (N, ,..., ND has a distribution 4, Over RS, 
These measures u, are the finite-dimensional distributions of the pro- 
cess. Condition 2° of this section in effect specifies them for the Poisson case: 


, Cong yo 
(23.27) P[N,=n,,j<k] ue Maen 7,50.) 


ifO<n,<-:: <n, andQ<t, < <i dake mo io 0 

The finite-dimensional distributions do not, however, contain all the 
mathematically interesting information about the process in the case of 
continuous time. Because of (23.3), (23.4), and the definition (23.5), for each 
fixed w, N (w) as a function of t has the regularity properties given in the 
second version of Condition 0°. These properties are used in an essential way 
in the proofs. 

Suppose that f(t) is £ or 0 according as t is rational or irrational. Let N, 
be defined as before, and let 


(23.28) M, (w) =N (w) +f(t +X (@)). 


If R is the set of rationals, then Plw: f(t +X (œo) =+ 0]=P[o: X(w)E 
R—t]=0 for each t because R — t is countable and X, has a density. Thus 
P[M, =N,]= 1 for each ¢, and so the stochastic process [M,: t > 0] has the 
same finite-dimensional distributions as [N,: t>0]. For œ fixed, however, 
Mw) as a function of t is everywhere discontinuous and is neither mono- 
tone nor exclusively integer-valued. 

The functions obtained by fixing w and letting ¢ vary are called the path 
functions or sample paths of the process. The example above shows that the 
finite-dimensional distributions do not suffice to determine the character of 
the path functions. In specifying a stochastic process as a model for some 
phenomenon, it is natural to place conditions on the character of the sample 
paths as well as on the finite-dimensional distributions. Condition 0° was 
imposed throughout this section to ensure that the sample paths are nonde- 
creasing, right-continuous, integer-valued step functions, a natural condition 
if N, is to represent the number of events in [0,1]. Stochastic processes in 
continuous time are studied further in Chapter 7. 


SECTION 23. THE POISSON PROCESS 309 


PROBLEMS 


Assume the Poisson processes here satisfy Condition 0° as well as Condition 1°. 


23.1. 


23.2. 


235: 


23.6. 


23.7. 


Show that the minimum of independent exponential waiting times is again 
exponential and that the parameters add. 


20.177 Show that the time S, of the nth event in a Poisson stream has the 
gamma density f(x;a,n) as defined by (20.47). This is sometimes called the 
Erlang density. 


. Let A,=t—Sy be the time back to the most recent event in the Poisson 


stream (or to 0), and let B, = Sy ,,—t be the time forward to the next event. 
Show that A, and B, are independent, that B, is distributed as X, (exponen- 
tially with parameter a), and that A, is distributed as min{X,, t}: P[A, < t] is 
oies or las x <0, 0 Sx <1, or XE. 


- 71 Let L, =A, TB, = Sno Sn, be the length of the interarrival interval 


covering t. 
(a) Show that L, has density 


aa) = a? xe “* FOSES E, 
Da ae A 


(b) Show that E[L,] converges to 2E[X,] as t >~. This seems paradoxical 
because L, is one of the X,. Give an intuitive resolution of the apparent 


paradox. 


Merging Poisson streams. Define a process {N,} by (23.5) for a sequence {X,,} of 
random variables satisfying (23.4). Let {X/} be a second sequence of random 
variables, on the same probability space, satisfying (23.4), and define {N/} by 
N’ = max[n: Xi + +- +X; <t]. Define {N"} by N,’=N,+N,. Show that, if 
o(X,, X5,...) and o(X{, X5,...) are independent and {N,} and {N/} are 
Poisson processes with respective rates a and £, then {N,’} is a Poisson process 
with rate a+ B. 


? The nth and (n + 1)st events in the process {N,} occur at times S,, and 
Ae 

(a) Find the distribution of the number Nj |, — Ns, of events in the other 
process during this time interval. 

(b) Generalize to Ny — Ng. 


Suppose that X,, X>,... are independent and exponentially distributed with 
parameter a, so that (23.5) defines a Poisson process {N,}. Suppose that 
Y,,Y>,-.. are independent and identically distributed and that o( X,, X,,...) 
and ø(Y,, Y2,...) are independent. Put Z, = ee Nn Yk: This is the compound 
Poisson process. If, for example, the event at time S, in the original process 


310 


23.8. 


23.9. 


23.10. 


RANDOM VARIABLES AND EXPECTED VALUES 


represents an insurance claim, and if Y, represents the amount of the claim, 
then Z, represents the total claims to time £. 

(a) If Y, =1 with probability 1, then {Z,} is an ordinary Poisson process. 

(b) Show that {Z,} has independent increments and that Z,,,—Z, has the 
same distribution as Z,. 

(c) Show that, if Y, assumes the values 1 and 0 with probabilities p and 1 — p 
(0 <p < 1), then {Z,} is a Poisson process with rate pa. 


Suppose a process satisfies Condition 0° and has independent, Poisson-distrib- 
uted increments and no fixed discontinuities. Show that it has the form LN iy), 
where {N,} is a standard Poisson process and ¢ is a nondecreasing, continuous 
function on [0,œ) with g(0) = 0. 


If the waiting times X, are independent and exponentially distributed with 
parameter a, then S,/n >a! with probability 1, by the strong law of large 
numbers. From lim, _,,. N, =œ and Sy <t<Swy 4, deduce that lim,_,,, N,/t = 
a with probability 1. 


T (a) Suppose that X,,X,,... are positive, and assume directly that 
S,/n >m with probability 1, as happens if the X, are independent and 
identically distributed with mean m. Show that lim, N,/t = 1/m with probabil- 
ity 1. 

(b) Suppose now that S,/n —> œ with probability 1, as happens if the X,, are 
independent and identically distributed and have infinite mean. Show that 
lim, N,/t = 0 with probability 1. 


The results in Problem 23.10 are theorems in renewal theory: A component of 
some mechanism is replaced each time it fails or wears out. The X, are the lifetimes 
of the successive components, and N, is the number of replacements, or renewals, to 
time t. 


23.11. 


23.12. 


23.13. 


20.7 23.107 Consider a persistent, irreducible Markov chain, and for a fixed 
state j let N, be the number of passages through j up to time n. Show that 
N,„/n > 1/m with probability 1, where m = L2_, kf is the mean return time 
(replace 1/m by 0 if this mean is infinite). See Lemma 3 in Section 8. 


Suppose that X and Y have Poisson distributions with parameters a and £. 
Show that |P[ X =i] — P[Y = i]| < |a — |. Hint: Suppose that « < B, and repre- 
sent Y as X + D, where X and D are independent and have Poisson distribu- 
tions with parameters a and B — a. 


T Use the methods in the proof of Theorem 23.2 to show that the error in 
(23.15) is bounded uniformly in i by |À —A,,|+A,, max, Png. 


SECTION 24. THE ERGODIC THEOREM * 


Even though chance necessarily involves the notion of change, the laws 
governing the change may themselves remain constant as time passes: If time 


“This section may be omitted. There is more on ergodic theory in Section 36. 


SECTION 24. THE ERGODIC THEOREM 311 


does not alter the roulette wheel, the gambler’s fortunes fluctuate according 
to constant probability laws. The ergodic theorem is a version of the strong 
law of large numbers general enough to apply to any system governed by 
probability laws that are invariant in time. 


Measure-Preserving Transformations 


Let (Q, Z, P) be a probability space. A mapping T: Q > Q is a measure-pre- 
serving transformation if it is measurable F/ F and P(T~'A) = P(A) for all 
A in F, from which it follows that P(T~"A) = P(A) for n > 0. If, further, T 
is a one-to-one mapping onto Q and the point mapping T~! is measurable 
F/F (TAE F for AE F), then T is invertible; in this case T~' automati- 
cally preserves measure: P(A) = P(T 'TA) = P(TA). 

The first result is a simple consequence of the m—A theorem (or Theorems 
13.1(i) and 3.3). 


Lemma 1. If # is a m-system generating F, and if T 'AE F and 
P(T~'A)=P(A) for A € FY, then T is a measure-preserving transformation. 


Example 24.1. The Bernoulli shift. Let S be a finite set, and consider the 
space $” of sequences (2.15) of elements of S. Define the shift T by 


(24.1) Tw = (2,(),23(@),---); 


the first element of w =(z,(w), z,(w),...) is lost, and T shifts the remaining 
elements one place to the left: z,(Tw) =z,,,(w) for k > 1. If A is a cylinder 
(2.17), then 


(24.2) T- ajlo: Za E 228 (@) i= a 


D [w: EAC) eee) eSxH| 


is another cylinder, and since the cylinders generate the basic o-field @, T is 
measurable 6/6. 

For probabilities p, on S (nonnegative and summing to 1), define product 
measure P on the field @ of cylinders by (2.21). Then P is consistently 
defined and countably additive (Theorem 2.3) and hence extends to a 
probability measure on &=o(%@,). Since the thin cylinders (2.16) form a 
q-system that generates @, P is completely determined by the probabilities it 
assigns to them: 


(24.3) Plo; zalal a Zzaan)ee biases) | ete) Din. 


n 


312 RANDOM VARIABLES AND EXPECTED VALUES 


If A is the thin cylinder on the left here, then by (24.2), 


(24.4) P(T A) = L DiBa Dig Pan Pep een 


ues 


and it follows by the lemma that T preserves P. This T is the Bernoulli shift. 

If A =[w: z (w) =u] (u € S), then 1,(T*~ 'w) is 1 or 0 according as z,(w) 
is u or not. Since by (24.3) the events [w: z,(w) =u] are independent, each 
with probability p,, the random variables [,(T*‘~'w) are independent and 
take the value 1 with probability p, = P(A). By the strong law of large 
numbers, therefore, 


(24.5) lim — E 1,(T'w) = P(A) 
4 k=1 


with probability 1. D 


Example 24.1 gives a model for independent trials, and that T preserves P 
means the probability laws governing the trials are invariant in time. In the 
present section, it is this invariance of the probability laws that plays the 
fundamental role; independence is a side issue. 

The orbit under T of the point w is the sequence (w, Tw, T?w,...), and 
(24.5) can be expressed by saying that the orbit enters the set A with 
asymptotic relative frequency P(A). For A =[w: (z (w), z,(w)) =(u,,u,)), 
the 1,(T*~'w) are not independent, but (24.5) holds anyway. In fact, for the 
Bernoulli shift, (24.5) holds with probability 1 whatever A may be (A € F). 
This is one of the consequences of the ergodic theorem (Theorem 24.1). 
What is more, according to this theorem the limit in (24.5) exists with 
probability 1 (although it may not be constant in w) if T is an arbitrary 
measure-preserving transformation on an arbitrary probability space, of which 
there are many examples. 


Example 24.2. The Markov shift. Let P =[ p;,;] be a stochastic matrix with 
rows and columns indexed by the finite set S, and let mT; be probabilities on 
S. Replace (2.21) by P(A) = E yT, Puyu, °° Pu, _ju,- The argument in Section 
2 showing that product measure is consistently defined and finitely additive 
carries over to this new measure: since the rows of the transition matrix add 
fo I, 


er 


pe Par adel Y PS 


xs N -R 
nUn+l Din sth 1, 


and so the argument involving (2.23) goes through. The new measure is a ai 


| 


SECTION 24. THE ERGODIC THEOREM 313 


countably additive on @& (Theorem 2.3) and so extends to @. This probabil- 
ity measure P on @ is uniquely determined by the condition 


(24.6) Pl w:(z,(@), ...52,(@)) = (Wis ce sie a itn vives S pela 


Thus the coordinate functions z,(-) are a Markov chain with transition 
probabilities p, and initial probabilities 7;. 

Suppose that the m; (until now unspecified) are stationary: Lj77; pij = Tj- 
Then 


D Ty Puu, Puju, i Pune z Tu Puju, Bei Pu, un? 


ues 


and it follows (see (24.4)) that T preserves P. Under the measure P specified 
by (24.6), T is the Markov shift. a 


The shift T, qua point transformation on S$”, is the same in Examples 24.1 
and 24.2. A measure-preserving transformation, however, is the point trans- 
formation together with the o-field with respect to which it is measurable and 
the measure it preserves. 


Example 24.3. Let P be Lebesgue measure A on the unit interval, and take 
Tw = 2w (mod 1): 


P 2w if0<w <+, 
ait E a 


If w has nonterminating dyadic expansion w =.d(w)d,(w)..., then Tw = 
.dw)d,(w)...: T shifts the digits one place to the left—compare (24.1). Since 
T '(0, x] =(0, 4x] UG, 4 + 4x], it follows by Lemma 1 that T preserves Lebesgue 
measure. This is the dyadic transformation. a 


Example 24.4. Let Q be the unit circle in the complex plane, let F be the o-field 
generated by the arcs, and let P be normalized circular Lebesgue measure: map [0, 1) 
to the unit circle by #(x) =e27"* and define P by P(A) =A(@~'A). For a fixed c in 
Q, let Tw =cw. Since T is effectively the rotation of the circle through the angle 
argc, T preserves P. The rotation turns out to have radically different properties 
according as c is a root of unity or not. i 


Ergodicity 


The Fset A is invariant under T if T~'A =A; it is a nontrivial invariant set 
if 0<P(A)<1. And T is by definition ergodic if there are in F no 


wey 


314 RANDOM VARIABLES AND EXPECTED VALUES 


nontrivial invariant sets. A measurable function f is invariant if f(Tw) = f(w) 
for all w; A is invariant if and only if J, is 
The ergodic theorem: 


Theorem 24.1. Suppose that T is a measure-preserving transformation on 
(Q, ¥, P) and that f is measurable and integrable. Then 


(24.7) lim + > f(T*-'w) =f() 
t: k=1 


with probability 1, where f is invariant and integrable and E[ f] = EI f]. If T is 
ergodic, then f = E[ f ) with probability 1. 


This will be proved later in the section. In Section 34, f will be identified 
as a conditional expected value (see Example 34.3). 
If f= 1;, (24.7) becomes 


sa SRE a 
(24.8) lim = 22 (Tio a 
i k=1 


and in the ergodic case, 
(24.9) [,(@) =P(A) 


with probability 1. If A is invariant, then P) is 1 on A and 0 on A‘, and 
so the limit can certainly be nonconstant if T is not ergodic. 


Example 24.5. Take Q= {a,b,c,d,e} and F=2®. If T is the cyclic 
permutation T=(a,b,c,d,e) and T preserves P, then P assigns equal 
probabilities to the five points. Since Ø and Q are the only invariant sets, T 
is ergodic. It is easy to check (24.8) and (24.9) directly. 

The transformation T = (a, b, cd, e), a product of two cycles, preserves P 
if and only if a,b,c have equal probabilities and d,e have equal probabili- 
ties. If the Bae bers are all positive, then since 4 b,c} is invariant, T is 
not ergodic. If, say, A = {a, d}, the limit in (24.8) is 4 on {a,b,c} and 4 on 
{d, e}. This illustrates the essential role of ergodicity. a 


The coordinate functions z,(-) in Example 24.1 are independent, and 
hence by Kolmogorov’s zero-one law every set in the tail field J= 
N n0(ZnsZn4,---) has probability 0 or 1. (That the z, take values in the — 
abstract set $ does not affect the arguments.) If A E alzira ), then 
T "AE aE PE E S Co(Zn 41) Zn+2>°° A since this is true for eac 
A€ F=o(z,,Z,...) implies (Theorem 13.1(i)) TRA Ta N 
For A invariant, it follows that A € Z: The Bernoulli shift 


SECTION 24. THE ERGODIC THEOREM 315 


the ergodic theorem does imply that (24.5) holds with probability 1, whatever 
A may be. 
The ergodicity of the Bernoulli shift can be proved in a different way. If 


A=[(z,,...,Z,) =u] and B=[(z,...,z,) =v] for an n-tuple u and a k- 
tuple v, and if P is given by (24.3), then P(A N T-"B) = P(A) P(B) because 
T-"B =([(z,41)-++> Znan) = UJ. Fix n and A, and use the 7-A theorem to 


show that this holds for all B in Z. If B is invariant, then P(A NB) = 
P(A)P(B) holds for the A above, and hence (77-A again) holds for all A. 
Taking B =A shows that P(B)=(P(B))* for invariant B, and P(B) is 0 or 
1. This argument is very close to the proof of the zero—one law, but a 
modification of it gives a criterion for ergodicity that applies to the Markov 
shift and other transformations. 


Lemma 2. Suppose that PC F, Cc F, where Fo is a field, every set in A, 
is a finite or countable disjoint union of Asets, and Fy generates F. Suppose 
further that there exists a positive c with this property: For each A in F there is 
an integer n4 such that 


(24.10) P(AT~"4B) > cP(A)P(B) 
for all B in P. Then T~'C = C implies that P(C) is 0 or 1. 


It is convenient not to require that T preserve P; but if it does, then it is 
an ergodic measure-preserving transformation. In the argument just given, A 
consists of the thin cylinders, n,=n if A=[(z,...,z,)=u], c=1, and 
F, = 6p is the class of cylinders. 


Proor. Since every “set is a disjoint union of Asets, (24.10) holds for 
Be Z, (and A € P). Since for fixed A the class of B satisfying (24.10) is 
monotone, it contains Y (Theorem 3.4). If B is invariant, it follows that 
P(A B)2>cP(A)P(B) for A in #. But then, by the same argument, the 
inequality holds for all A in Z. Take A= B°“: If B is invariant, then 
P(B°)P(B) =0 and hence P(B) is 0 or 1. @ 


To treat the Markov shift, take F) to consist of the cylinders and F to 
consist of the thin ones. If A =[(z,,...,z,)=(u,,...,u,)] 214=n+m—1, 
and B=[(z,,...,z,) =(v,,...,U,)], then 


P(A)P(B) ad Ty, Pujur lie: Puniu, Te, Preyer eam Pe, _ wey? 


(24.11) i fv) 
P(A ‘a T 4B) iii Ty, Pujua VERG Puy, uy, Pune Peia f= Prey _ wwe 

The lemma will apply if there exist an integer m and a positive c such that 
p\” > cr, for all i and j. By Theorem 8.9 (or Lemma 2, p. 125), there is i 
the irreducible. aperiodic case an m such that all py” are positive; take c- 


316 RANDOM VARIABLES AND EXPECTED VALUES 


less than the minimum. By Lemma 2, the corresponding Markov shift is 
ergodic. 


Example 24.6. Maps preserve ergodicity. Suppose that y: (2 is measurable 
F/F and commutes with T in the sense that YTw = Tw. If T preserves P, it also 
preserves Py ': PW \(T~'A) = P(T 'y 'A)= Py '(A). And if T is ergodic under 
P, it is also ergodic under Py~'!: if A is invariant, so is Y'A, and hence Py '(A) is 
0 or 1. These simple observations are useful in studying the ergodicity of stochastic 
processes (Theorem 36.4). a 


Ergodicity of Rotations 


The dyadic transformation, Example 24.3, is essentially the same as the Bernoulli 
shift. In any case, it is easy to use the zero—one law or Lemma 2 to show that it is 
ergodic. From this and the ergodic theorem, the normal number theorem follows once 
again. 

Consider the rotations of Example 24.4. If the complex number c defining the 
rotation (Tw =cw) is — 1, then the set consisting of the first and third quadrants is a 
nontrivial invariant set, and hence T is not ergodic. A similar construction shows that 
T is nonergodic whenever c is a root of unity. 

In the opposite case, c is ergodic. In the first place, it is an old number-theoretic 
fact due to Kronecker that if c is not a root of unity then the orbit (w, cw, c7w,...) of 
every w is dense. Since the orbits are rotations of one another, it suffices to prove that 
the orbit (1, c,c*,...) of 1 is dense. But if c is not a root of unity, then the elements of 
this orbit are all distinct and hence by compactness have a limit point wọ. For 
arbitrary e, there are distinct points c” and c”** within €/2 of wg and hence within 
e of each other (distance measured along the arc). But then, since the distance from 
co" tk to ce" +U+ Dk is the same as that from c” to c”**, it is clear that for some m the 
points c”,c”**,...,¢"*™* form a chain which extends all the way around the circle 
and in which the distance from one point to the next is less than e. Thus every point 
on the circle is within e of some point of the orbit (1,c,c?,...), which is indeed dense. 

To use this result to prove ergodicity, suppose that A is invariant and P(A) > 0. 
To show that P(A) must then be 1, observe first that for arbitrary e there is an arc J, 
of length at most e, satisfying P(A N I) > (1 —e)P(A). Indeed, A can be covered by 
a sequence J,, J,,... of arcs for which P(A)/(1 — e) > &,, PU); the arcs can be taken 
disjoint and of length less than e. Since LY, P(A A L) = P(A) > (1 —€)E, PU,), there 
is an n for which P(A NA 1,)> (1 —e)PU,,): take 1=J,. Let I have length a; a <e. 

Since A is invariant and T is invertible and preserves P, it follows that P(AN 
T"I)>(1—e)P(T"J). Suppose the arc J runs from a to b. Let n, be arbitrary and, 
using the fact that {7a} is dense, choose n, so that T"'J and T"2/ are disjoint and 
the distance from T"'b to Ta is less than ea. Then choose n, so that T™1, T™1, T™I 
are disjoint and the distance from 7"?b to T"3a is less than ew. Continue until T”*b 
is within a of T”'a and a further step is impossible. Since the T™{ are disjoint, 
ka < 1; and by the construction, the T™I cover the circle to within a set of measure 
kea +a, which is at most 2e. And now by disjointness, 


k k 
P(A)> L P(ANT"!) = (L-e) L P(T"T) = (1 -€)(1 —2¢). 
i=l i=1 


Since e was arbitrary, P(A) must be 1: T is ergodic if c is not a root of unity.’ 


tFor a simple Fourier-series proof, see Problem 26.30. 


SECTION 24. THE ERGODIC THEOREM 317 


Proof of the Ergodic Theorem 


The argument depends on a preliminary result the statement and proof of 
which are most clearly expressed in terms of functional operators. For a real 
function f on Q, let Uf be the real function with value (Uf Xw) = f(Tw) at 
w. If f is integrable, then by change of variable (Theorem 16.13), 


(24.12) E[Uf] = | f(To)P(do) = | f(@) PT (dø) =E[f]. 


And the operator U is nonnegative in the sense that it carries nonnegative 
functions to nonnegative functions; hence f < g (pointwise) implies Uf < Ug. 

Make these pointwise definitions: S,f=0, S,f=f+Uf+-:::+U"~"f, 
M,, f= maxy.,<,5,f, and Maf = SUP, > o Sn f =sup,>9M,f. The maximal 
ergodic theorem: 


Theorem 24.2. Iff is integrable, then 


(24.13) ihe feo 


Proor. Since B, =[M,f>O0]t[M.f > 0], it is enough, by the dominated 
convergence theorem, to show that fs fdP>0O. On B,, M,f= 
MaX; -k-n 5,f. Since the operator U is nonnegative, S,f=f+ US,_,f< 
f+ UM,f for 1 <k <n, and therefore M, f <f + UM, f on B,. This and the 
fact that the function UM, f is nonnegative imply 


[M,faP= f M,faP< | (f+ UM, f ) dP 


< Ih faP + J uM, fdP = i faP + f[M, faP, 


where the last equality follows from (24.12). Hence fg, fdP = 0. E 
Replace f by fl,. If A is invariant, then S,(fl,) =(S, f)14, and MA fla) 
= (M_f)1,, and therefore (24.13) gives 


(24.14) f fdP>0 if T =A. 
AN[(M.f>0] 


Now replace f here by f—A, A a constant. Clearly [Mf — A) > 0] is the set 


Ae SA 


318 RANDOM VARIABLES AND EXPECTED VALUES 


where for some n > 1, S,(f—A)>0, or n-'S, f >A. Let 


(24.15) F, = fo: sup = a f(T*"'w) >A 
n>=1 k=1 


it follows by (24.14) that {,,,.(f— A) dP = 0, or 
(24.16) AP(ANF,)<f  fdP if T'A=A. 
ANF, 


The A here can have either sign. 


Proof OF THEOREM 24.1. To prove that the averages a,(w) = 
n ‘y"_,f(T*~'w) converge, consider the set 


A,.g=|o: lim infa,(@) <a <B < lim supa,() 


for exf. Since Aap- Aup Fp and Arp 1s invariant, (24.16) gives 
BP(A, 2) < Ja ,faP.- “The same result with y —B, —a in place of f, æ, B is 
fa. fdP <aPlA,, g). Since a<f, the two inequalities together lead to 
P(A, 3) = 0. Take the union over rational a and B: The averages a,(@) 
ee: ae that is, 


(24.17) lima,(w) =f(@) 


with probability 1, where f may take the values +œ% at certain values of w. 
Because of (24.12), Ela,]=E[f]; if it is shown that the a,(w) are 
uniformly integrable, then it will follow (Theorem 16.14) that f is integrable 
and E[ f]= Elf]. 
By (24.16), AP(F,) < E[|f |]. Combine this with the same inequality for —f: 
If G, =[w: sup,|a,(@)| >A], then AP(G,) < 2El| fll (trivial if A < 0). There- 
fore, for positive a and À, 


a ales; ES 'w)|P(do) 


A 
bm 


3 
k=1 
E (f cays ATOPE) +aP(G)) 


Il 
T; 
& 


gf (a) |P(dw) +aP(G,) 


< la |P(dw) +2ŠE[IfI]. 


Take a = à!/?; since f is integrable, the final expression here goes to 


SECTION 24. THE ERGODIC THEOREM 319 


À > œ. The a,(w) are therefore uniformly integrable, and E[ Îl =E[f]. The 
uniform integrability also implies E[|a,, = ff] > 0. 

Set f(w)=0 outside the set where the a,(w) have a finite limit. Then 
(24.17) still holds with probability 1, and To) = flw). Since [w: Fo) <x] is 
invariant, in the ergodic case its measure is either 0 or 1; if x9 is the infimum 
of the x for which it is 1, then f(w) = Xo with probability 1, and from 

= E[ f] = E[ f ] it follows that f(w) = EL f] with probability 1. a 


The Continued-Fraction Transformation 


Let © consist of the irrationals in the unit interval, and for x in Q let 
Tx = {1/x} and a(x) =|1/x] be the fractional and integral parts of 1/x. This 
defines a mapping 


(24.18) Te={7\=2- = = -a,(x) 


x 


of Q into itself, a mapping associated with the continued-fraction expansion 
of x [A36]. Concentrating on irrational x avoids some trivial details con- 
nected with the rational case, where the expansion is finite; it is an inessential 
restriction because the interest here centers on results of the almost-every- 
where kind. 

For x €Q and n>1 let a,(x)=a,(T"~'x) be the nth partial quotient, 
and define integer-valued functions p,(x) and q,(x) by the recursions 


(24.19) 
p(x) =1, 5 Do X) = Ope pyle). = n(x) p_— Cs) pan eg eee 
q_(«)=0, = @(x)=1, > -q,) =9,@)¢,40 qe Gee 


Simple induction arguments show that 


(24.20)  P„-i(x)a, (x) — Pa X)4n—(*) =(-1)", nzæ0, 


and [A37: (27)] 


(24.21) x=1fa(x) +--+ +1fa,_,(x) +1fa,(4) +7, nal, 
It also follows inductively [A36: (26)] that 
(24.22) 1fa,(x)+--: +1Ja,_\(x) + 1fa,(x) Ft 


_ PylX) + tp,- (x) 


, =). p 
9,() + t4, (5) nei, ere 


320 RANDOM VARIABLES AND EXPECTED VALUES 
Taking t = 0 here gives the formula for the nth convergent: 


P,( x) 
Gy( x)’ oa 


(24.23) Uat) + +>: +1]Ja,(%) = 


where, as follows from (24.20), p,(x) and q,(x) are relatively prime. By 
(24.21) and (24.22), 


pe P(x) ne (Tt) Pyne) 


24.24 Se ee hy, 

Comat TOUD TO 

which, together with (24.20), implies? 

(as) x PM). iD n>0. 


n(x) ga(x)((T"x) a(x) + dy) | 


Thus the convergents for even n fall to the left of x, and those for odd n 
fall to the right. And since (24.19) obviously implies that q,(x) goes to infinity 
with n, the convergents p,(x)/q,(x) do converge to x: Each irrational x in 
(0,1) has the infinite simple continued-fraction representation 


(24.26) x =1Ja,(x) + 1Ja,(x) + <A 


The representation is unique [A36: (35)], and Tx = 1[a,(x) +1 ao) 
T shifts the partial quotients in the same way the dyadic transformation 
(Example 24.3) shifts the digits of the dyadic expansion. Since the continued- 
fraction transformation turns out to be ergodic, it can be used to study the 
continued-fraction algorithm. 

Suppose now that a,,a,,... are positive integers and define P, and q,, by 
the recursions (24.19) without the argument x. Then (24.20) again holds 
(without the x), and so p,/qd,, —Pn—\/G,-, = (—1)""! /@ _ @.. » > Se 
q,, increases to infinity, the right side here is the nth term of a convergent 
alternating series. And since p)/q, = 0, the nth partial sum is P,/Qn> Which 
therefore converges to some limit: Every simple infinite continued fraction 
converges, and [A36: (36)] the limit is an irrational in (0, 1). 

Let Oy Pa be the set of x in Q such that a,(x) = a, for 1 <k <n; call it 
a fundamental set of rank n. These sets are analogous to the dyadic intervals 
and the thin cylinders. For an explicit description of A,,...a,-—necessary for 
the proof of ergodicity—consider the function 


(24.27) Wo aft) =1fa,+ ++: +1fa,_,+1]a, Ft. 


*Theorem 1.4 follows from this. 


SECTION 24. THE ERGODIC THEOREM 321 


lf xE Aj , then x=, ,(T"x) by (24.21); on the other hand, because 
of the uniqienets of the partial quotients [A36: (33)], if t is an irrational in 
the unit interval, then (24.27) lies in A, ,,. Thus A, ,, is the image under 
(24.27) of itself. 

Just as (24.22) follows by induction, so does 


®) al Pi- 1 
(24.28) TOE T T 
And Wa, a(t) is increasing or decreasing in t according as n is even or odd, 


as is clear from the form of (24.27) (or differentiate in (24.28) and use 
(24.20)). It follows that 


T 4 
(2, Pa Pa) na if n is even, 
dn Gh Gre i 
Oi dear a. 
(2i, alow if n is odd. 
G, a, 1 dn 


By (24.20), this set has Lebesgue measure 


1 
(24.29) A(AG = 49) Gr ay 

The fundamental sets of rank n form a partition of Q, and their unions 
form a field Z; let Fo = UZ- Then Fo is the field generated by the 
class F of all the fundamental sets, and since each set in Y is in some Z, 
each is a finite or countable disjoint union of sets. Since q, > 2q,_> by 
(24.19), induction gives q, > 2 '/* for n > 0. And now (24.29) implies that 
Fa generates the o-field Z of linear Borel sets that are subsets of Q (use 
Theorem 10.1(ii)). Thus FY, Yo, F are related as in the hypothesis of Lemma 
2. Clearly T is measurable F/ F. 


Although T does not preserve A, it does preserve Gauss’s measure, 
defined by 


(24.30) P(A) = AEF., 


log lwt ° 


In fact, since 


0 


T~'((0,t) NQ) = U (Gest i) no), 


322 RANDOM VARIABLES AND EXPECTED VALUES 


it is enough to verify 


t ax =, ptyk AX pi meat ep lik Bie 
ces " Tee eon Le 


Gauss’s measure is useful because it is preserved by 7 and has the same sets 
of measure 0 as Lebesgue measure does. 

Proof that T is ergodic. Fix a,,...,4,, and write Y, for pa,...a, and A, for 
A,,...a,, Suppose that n is even, so that Y, is increasing. If x &A,, then 
(since x = W,(T"x))s <T"x <t if and only if ,(s) <x <,(t); and this last 
condition implies x € A,. Combined with (24.28) and (24.20), this shows that 


A(A,, O[x: s <T"x <t])=y,(t) -—¥,(5) = CREM 


If B is an interval with endpoints s and t, then by (24.29), 


a A ste ee wee (an ae te e a)i 


A similar argument establishes this for n odd. Since the ratio on the right lies 
between ¿4 and 2, 


(24.31) ACADABA <A(A, OT "B) <2A(A,)A(B). 


Therefore, (24.10) holds for P, A), F as defined above, A = A, N4g=n, 
c = 3, and A in the role of P. Thus T~'C = C implies that A(C) is 0 or 1, and 
since Gauss’s measure (24.30) comes from a density, P(C) is 0 or 1 as well. 
Therefore, T is an ergodic measure-preserving transformation on (Q, F, P). 


It follows by the ergodic theorem that if f is integrable, then 


ne tore 
(24.32) img EAT) x of {2 a 


holds almost everywhere. Since the density in (24.30) is bounded away from 0 
and œ, the “integrable” and “almost everywhere” here can refer to P or to A 
indifferently. 

Taking f to be the indicator of the x-set where a,(x) =k shows that the 
asymptotic relative frequency of k among the partial quotients is almost 
everywhere equal to 


I sma 1 (k +1)? ‘ Akai 
ait = —~ log E 
log2 J) (ny; 1 ty - log2 


In particular, the partial quotients are unbounded almost everyw! aai 


SECTION 24. THE ERGODIC THEOREM 323 


For understanding the accuracy of the continued-fraction algorithm, the 
magnitude of a,(x) is less important than that of q,(x). The key relationship 


is 
1 


1 x Pal x) 
Gi X)Gn+1(%) i 


aal) anal) | aE) 


(24.33) 


which follows from (24.25) and (24.19). Suppose it is shown that 


9 j i 1 i 
(24.34) lim 55 108 Gn) =i 


almost everywhere; (24.33) will then imply that 


ar? 


hy ACHT 
6 log 2 ` 


F(X) 


(24.35) lim S log| x 


The discrepancy between x and its nth convergent is almost everywhere of 


the order e7”7 /(6 1082) 
To prove (24.34), note first that since P; a= q; (Tx) by (24.19), the 


product TEAD aa N ae T H) telescopes to 1/q,(x): 


P, Lie (ES) 


24.36 lo: =a = ; 
(27-36) hg ae E aaa jeer (IP) 


As observed earlier, q,(x) > 247/2 for n > 1. Therefore, by (24.33), 


1 1 


X 
L -aa 
Pr( X)/4n(*) dra A A AG 


Since llog + s)| < 4|s| if Is| < 1/ V2, 


P(x) 4 
log x — log EHEN < zn 72 
Therefore, by (24.36), 
4 SO © aes 
log T*~'x — lo < log T*~'x — log === 
ao Teri T La 


4 
SEn <0, 


324 RANDOM VARIABLES AND EXPECTED VALUES 


By the ergodic theorem, then,' 


yee | | De 4 8 —1 ie oe dx 
im 5 lon ray = enh Tee lord), ee 
ESEE ravages 
T log2 peo (RF IN 12 log 2 


Hence (24.34). 


Diophantine Approximation 


The fundamental theorem of the measure theory of Diophantine approximation, due 
to Khinchine, is Theorem 1.5 together with Theorem 1.6. As in Section 1, let (q) be 
a positive function of integers and let A, be the set of x in (0,1) such that 


Dp 1 
(24.37) k= k sae 
q°9(q) 


has infinitely many irreducible solutions p/q. If £1/qg(q) converges, then A, has 
Lebesgue measure 0, as was proved in Section 1 (Theorem 1.6). It remains to prove 
that if œ is nondecreasing and £1/qọ(q) diverges, then A, has Lebesgue measure 1 
(Theorem 1.5). It is enough to consider irrational x. 


Lemma 3. For positive a,,, the probability (P or à) of [x: a,(x)>a,, i.0.] is 0 or 1 
as :1/a,, converges or diverges. 


Proor. Let E, =[x: a,(x)>a,]. Since P(E,) = P[x: a(x)>a,] is of the order 
1/a,,, the first Borel Cantelli lemma settles the convergent case (not needed in the 
proof of Theorem 1.5). 

By (24.31), 

t 
MA, NE, 41) = ACADA x: a(x) > @,4;] = A a 
An+] ari 


Taking a union over certain of the A,, shows that for m <n, 


MEN soe ABEN E a1) = A( ESN oe NE) Sa Say 
n+l 


By induction on n, 


1 
AEE Mit? Oude N (1 - a S SE 
( m :) I Tanan) 
< exp} — x Recs me 
as in the proof of the second Borel—Cantelli lemma. e 


tIntegrate by parts over (a, 1) and then let a | 0. For the series, see Problem 26.28. The spe 
value of the limit in (24. 34) is not needed for the application that follows. , 


SECTION 24. THE ERGODIC THEOREM 325 


Proor oF THEOREM 1.5. Fix an integer N such that log N exceeds the limit in 
(24.34). Then, except on a set of measure 0, 


(24.38) Qn( x) < N” 


holds for all but finitely many n. Since #is nondecreasing, 


1 N 
Lu qola) ~ p(N")’ 


N” <q<N"*! 


and £1/@(N") diverges if £1/q¢(q) does. By the lemma, outside a set of measure 0, 
a,,, (x) > @(N") holds for infinitely many n. If this inequality and (24.38) both hold, 
then by (24.33) and the assumption that g is nondecreasing, 


x Pn(X) 1 ae 
q,(x) On X)Gn4i1(%) n+1(*)4n(*) 
1 1 


= O(N")a2(x) ~ o(4,(x))a2(2) 


But p,(x)/q,(x) is irreducible by (24.20). B 


PROBLEMS 


24.1. Fix (Q, F) and a T measurable Y/F. The probability measures on (Q, F) 
preserved by T form a convex set C. Show that T is ergodic under P if and only 
if P is an extreme point of C—cannot be represented as a proper convex 
combination of distinct elements of C. 


24.2. Show that T is ergodic if and only if n~'D?~1P(AMT~*B) > P(A)P(B) for 
all A and B (or all A and B in a m-system generating F ). 


24.3. 1 The transformation T is mixing if 
(24.39) P(ANT"B) > P(A)P(B) 


for all A and B. 
(a) Show that mixing implies ergodicity. 


(b) Show that T is mixing if (24.39) holds for all A and B in a qr-system 
generating F. 


(c) Show that the Bernoulli shift is mixing. 
(d) Show that a cyclic permutation is ergodic but not mixing. 


(e) Show that if c is not a root of unity, then the rotation (Example 24.4) is 
ergodic but not mixing. 


326 


24.4, 


24.6. 


24.7. 


24.9. 


RANDOM VARIABLES AND EXPECTED VALUES 


T Write T-"F=[(T-"4: Ae F], and call the o-field Z, = Viragh oF, triv- 
ial if every set in it has probability either 0 or 1. (If T is invertible, Z, is Z and 
hence is trivial only in uninteresting cases.) 

(a) Show that if Z, is trivial, then T is ergodic. (A cyclic permutation is 
ergodic even though Z, is not trivial.) 

(b) Show that if the hypotheses of Lemma 2 are satisfied, then Z, is trivial. 
(c) It can be shown by martingale theory that if Z, is trivial, then T is mixing; © 
see Problem 35.20. Reconsider Problem 24.3(c). 


. 8.35 2447 (a) Show that the shift corresponding to an irreducible, aperiodic 


Markov chain is mixing. Do this first by Problem 8.35, then by Problem 24.4(b), 
(c). 


(b) Show that if the chain is irreducible but has period greater than 1, then the 
shift is ergodic but not mixing. 


(c) Suppose the state space splits into two closed, disjoint, nonempty subsets, 
and that the initial distribution (stationary) gives positive weight to each. Show 
that the corresponding shift is not ergodic. 


Show that if T is ergodic and if f is nonnegative and E[f]=, then 
n 'Di_,f(T*~'w) > œ with probability 1. 


24.37 Suppose that P,)(A)= f4ôdP for all A (6>0) and that T is mixing 
with respect to P (T need not preserve P)). Use (21.9) to prove 


P,(T~"A) = jas dP > P(A). 


. 24.67 (a) Show that 


1 n 
a os a(x) >% 
k=1 


and 


= E 1 (log k)/(log 2) 
yad) ax) > TI (1 a ETA 


k=1 


almost everywhere. 
(b) Show that 


DP, (x) 
a(x) 


xo 


1 2 
joel (2) |+- 


24.4 24.77 (a) Show that the continued-fraction transformation is mixing. 
(b) Show that 


log(1 + ¢) 
Sa 


A(x: T"x<t]—> Sa, 


O<t<l. onal 


CHAPTER 5 


Convergence of Distributions 


SECTION 25. WEAK CONVERGENCE 


Many of the best-known theorems in probability have to do with the asymp- 
totic behavior of distributions. This chapter covers both general methods for 
ieriving such theorems and specific applications. The present section con- 
cerns the general limit theory for distributions on the real line, and the 
methods of proof use in an essential way the order structure of the line. For 
the theory in R“, see Section 29. 


Definitions 


Distribution functions F, were defined in Section 14 to converge weakly to 
the distribution function F if 


(25.1) lim F,(x) = F(x) 


for every continuity point x of F; this is expressed by writing F, =F. 
Fxamples 14.1, 14.2, and 14.3 illustrate this concept in connection with the 
asymptotic distribution of maxima. Example 14.4 shows the point of allowing 
(25.1) to fail if F is discontinuous at x; see also Example 25.4. Theorem 25.8 
and Example 25.9 show why this exemption is essential to a useful theory. 

If u„ and yw are the probability measures on (R', 2') corresponding to F, 
and F, then F, =F if and only if 


(25.2) lim ,( A) = w( A) 


for every A of the form A =(—., x] for which u{x} = 0—see (20.5). In this 
case the distributions themselves are said to converge weakly, which is 
expressed by writing uw, =m. Thus F, =F and KL, =m are only different 
expressions of the same fact. From weak convergence it follows that (25.2) 
holds for many sets A besides half-infinite intervals; see Theorem 25.8. 


327- 


328 CONVERGENCE OF DISTRIBUTIONS 


Example 25.1. Let F, be the distribution function corresponding to a unit 
mass at n: F, =n Then lim, F,(x)=0 for every x, so that (25.1) is 
Satisfied if F(x) =0. But F, = F does not hold, because F is not a distribu- 
tion function. Weak convergence is defined in this section only for functions 
F, and F that rise from 0 at —œ to 1 at +o—that is, it is defined only for 


probability measures pw, and u. a 


Example 25.2. Poisson approximation to the binomial r Let Hn be the 
binomial distribution (20.6) for p = A /n and let u be the Poisson distribution 
(20.7). For nonnegative integers k, 


SEO (1a) 
SCESE, Si k-1 ; 
Aa a ) E hia] 


if n >k. As n > œ the second factor on the right goes to 1 for fixed k, and so 
uAk} > u{k}; this is a special case of Theorem 23.2. By the series form of 
Scheffé’s theorem (the corollary to Theorem 16.12), (25.2) holds for every set 
A of nonnegative integers. Since the nonnegative integers support u and the 
Hn, (25.2) even holds for every linear Borel set A. Certainly , converges 
weakly to y in this case. E 


Example 25.3. Let u, correspond to a mass of n~! at each point k/n, 
k=0,1,...,n — 1; let w be Lebesgue measure confined to the unit interval. 
The corresponding distribution functions satisfy F,(x) = (lnx | + 1)/n > F(x) 
for 0 <x <1, and so F, =F. In this case (25.1) holds for every x, but (25.2) 
does not, as in the preceding example, hold for every Borel set A: if A is the 
set of rationals, then u.,(A) = 1 does not converge to u(A) = 0. Despite this, 
H„ does converge weakly to u. E 


Example 25.4. If w,, is a unit mass at x, and p is a unit mass at x, then 
Hn >H if and only if x, > x. If x, >x for infinitely many n, then (25.1) fails 
at the discontinuity point of F, B 


Uniform Distribution Modulo 1* 


For a sequence x;,X,... Of real numbers, consider the corresponding sequence of 
their fractional parts {x,,} =x, —[x, |. For each n, define a probability measure u„ by 


(25.3) nA) = wk: 1 sk sn, (x,) EA]; 


ae | 
There is (see Section 28) a related notion of vague convergence in which u may be defect 
the sense that w(R')<1. Weak convergence is in this context sometimes called c 
convergence. | 
*This topic, which requires ergodic theory, may be omitted. 


wun ae 


SECTION 25. WEAK CONVERGENCE 329 


hw, has mass n~! at the points {x,},...,{x,}, and if several of these points coincide, 
the masses add. The problem is to find the weak limit of {w,,} in number-theoretically 
interesting cases. 

If the u,„ defined by (25.3) converge weakly to Lebesgue measure restricted to the 
unit interval, the sequence x,,X,... is said to be uniformly distributed modulo 1. In 
this case every subinterval has asymptotically its proportional share of the points {xp}; 
by Theorem 25.8 below, the same is then true of every subset whose boundary has 
Lebesgue measure 0. 


Theorem 25.1. For 0 irrational, 0,20,30,... is uniformly distributed modulo 1. 


Proor. Since {n0} = {n{6}}, 0 can be assumed to lie in [0,1]. As in Example 24.4, 
map [0,1) to the unit circle in the complex plane by #(x) =e*”"*. If 0 is irrational, 
then c=(@) is not a root of unity, and so (p. 000) Tw =cw defines an ergodic 
transformation with respect to circular Lebesgue measure P. Let A be the class of 
open arcs with endpoints in some fixed countable, dense set. By the ergodic theorem, 
the orbit {T"w} of almost every w enters every I in A with asymptotic relative 
frequency P(/). Fix such an w. If J, CJ cR, where J is a closed arc and J, J, are in 
S., then the upper and lower limits of n~'L7_,/,(T*~'w) are between P(I;) and 
P(J,), and therefore the limit exists and equals P(J). Since the orbits and the arcs are 
rotations of one another, every orbit enters every closed arc J with frequency P(J). 
This is true in particular of the orbit {c”} of 1. 

Now carry all this back to [0, 1) by #~!: For every x in [0, 1), {n0} = @~ '(c”) lies in 
[0, x] with asymptotic relative frequency x. a 


For a simple proof by Fourier series, see Example 26.3. 


Convergence in Distribution 


Let X, and X be random variables with respective distribution functions F, 
and F. If F, =F, then X, is said to converge in distribution or in law to X, 
written X, = X. This dual use of the double arrow will cause no confusion. 
Because of the defining conditions (25.1) and (25.2), X, =X if and only if 


(25.4) lim P[ X, <x] =P[X <x] 


for every x such that P[X =x] =0. 


Example 25.5. Let X,, X>,... be independent random variables, each 
with the exponential distribution: P[X,=x]=e “*, x20. Put M,= 
max{X,,..., X,} and b, =a~' logn. The relation (14.9), established in Exam- 
ple 14.1, can be restated as P[M, —b, <x] +e“ If X is any random 
variable with distribution function e~* “, this can be written M, —b, > X. 


One is usually interested in proving weak convergence of the distributions 
of some given sequence of random variables, such as the M,—b, in this 
example, and the result is often most clearly expressed in terms of the 
random variables themselves rather than in terms of their distributions or 


330 CONVERGENCE OF DISTRIBUTIONS 


distribution functions. Although the M, — b, here arise naturally from the 
Problem at hand, the random variable X is simply constructed to make it 
possible to express the asymptotic relation compactly by M,, — b, = X. Recall 
that by Theorem 14.1 there does exist a random variable for any prescribed 
distribution. 


Example 25.6. For each n, let ©,, be the space of n-tuples of 0’s and 1’s, 
let Z, consist of all subsets of Q,„, and let P, assign probability (A /n)*(1 — 
A/n)"-* to each w consisting of k 1’s and n—k O's. Let X,(w) be the 
number of 1’s in œw; then X,, a random variable on (Qa Fa Pa), represents 
the number of successes in n Bernoulli trials having probability A/n of 
success at each. 

Let X be a random variable, on some (Q, F, P), having the Poisson 
distribution with parameter A. According to Example 25.2, X,, > X. @ 


As this example shows, the random variables X, may be defined on 
entirely different probability spaces. To allow for this possibility, the P on the 
left in (25.4) really should be written P,. Suppressing the n causes no 
confusion if it is understood that P refers to whatever probability space it is 
that X, is defined on; the underlying probability space enters into the 
definition only via the distribution ,, it induces on the line. Any instance of 
F, =F or of uw, =u can be rewritten in terms of convergence in distribution: 
There exist random variables X, and X (on some probability spaces) with 
distribution functions F, and F, and F, =F and X, =X express the same 
fact. 


Convergence in Probability 


Suppose that X, X, X,,... are random variables all defined on the 
same probability space (Q, F, P). If X,—X with probability 1, then 
P{|X,, —X|= e€ i.o.]=0 for e > 0, and hence 


(25.5) lim P[|X, — X|>e] =0 


by Theorem 4.1. Thus there is convergence in probability X, >  X; see 
Theorems 5.2 and 20.5. 

Suppose that (25.5) holds for each positive e. Now P[X <x — e] — PUX, = 
X|>e] s P(X, <x] <PLX <x +e]+Pl|X, —x|>e]; letting n tend to and 
then letting e tend to 0 shows that P[X <x]<liminf, P[X, <x]< 
lim sup, P[X,, <x] < P[X <x]. Thus P[X, <x]—> P[X <x] if P[X =x]=0, 
and so X, > X: 


Theorem 25.2. Suppose that X„ and X are random variables on the same 
probability space. If X, >X with probability 1, then X, >p X.: If Xn >p X, 
then X, => X. 


SECTION 25. WEAK CONVERGENCE 331 


Of the two implications in this theorem, neither converse holds. Because 
of Example 5.4, convergence in probability does not imply convergence with 
probability 1. Neither does convergence in distribution imply convergence 
in probability: if X and Y are independent and assume the values 0 and 1 
with probability 5 each, and if X, = Y, then X, >X, but X, >p X cannot 
hold because P[|X — Y|] = 1 = 4. What is more, (25.5) is impossible if X and 
the X, are defined on different probability spaces, as may happen in the case 
of convergence in distribution. 

Although (25.5) in general makes no sense unless X and the X, are 
defined on the same probability space, suppose that X is replaced by a 
constant real number a—that is, suppose that X(w) =a. Then (25.5) be- 
comes 


(25.6) lim P[|X,, — al>e] =0, 


and this condition makes sense even if the space of X,, does vary with n. 
Now a can be regarded as a random variable (on any probability space at all), 
and it is easy to show that (25.6) implies that X, =a: Put « =|x —al; if 
x >a, then P[X, <x]>P[|X, —al<e]— 1, and if »<a, then BIX <a}= 
P||X,,—a|=e]— 0. If a is regarded as a random variable, its distribution 
function is 0 for x <a and 1 for x >a. Thus (25.6) implies that the distribu- 
tion function of X, converges weakly to that of a. 

Suppose, on the other hand, that X, =a. Then P[|X, —al>e]< PIX, =< 
a—e|+1-—P[X, <a+e]—0, so that (25.6) holds: 


Theorem 25.3. The condition (25.6) holds for all positive e if and only if 
X,, = a, that ts, if and only if 


n 


0 iisa: 


PLA. sals f ifx>a 


If (25.6) holds for all positive €, X, may be said to converge to a in 
probability. As this does not require that the X, be defined on the same 
space, it 1s not really a special case of convergence in probability as defined 
by (25.5). Convergence in probability in this new sense will be denoted 
X,, = a, in accordance with the theorem just proved. 

Example 14.4 restates the weak law of large numbers in terms of this 
concept. Indeed, if X,, X,,... are independent, identically distributed ran- 
dom variables with finite mean m, and if §, =X,+ -:: +X,,, the weak law of 
large numbers is the assertion n~'S, =m. Example 6.3 provides another 
illustration: If S„ is the number of cycles in a random permutation on | 
letters, then S, /logn = 1. | 


= CONVERGENCE OF DISTRIBUTIONS 


Example 25.7. Suppose that X, =X and ô, > 0. Given € and 7, choose 
x so that P[|X|>x]<n and P[X = +x]=0, and then choose no so that 
n >n implies that |5,|<e/x and |P[X,, <y]-PIX<yl<n for y= +x. 
Then P[I5,X,|>€]<3n for n>no. Thus X, >X and 6, —0 imply that 
5, X,, = 0, a restatement of Lemma 2 of Section 14 (p. 193). m 


The asymptotic properties of a random variable should remain unaffected 
if it is altered by the addition of a random variable that goes to 0 in 
probability. Let (X,, Y,) be a two-dimensional random vector. 


no n 


Theorem 25.4. If X, > X and X, -— Y, = 0, then Y, = X. 


Proor. Suppose that y'<x <y" and P[X=y']= P[X =y"J=0. If y'< 
x= €< F< He <y”, then 


(23.7) PX, =<y']—P[\x, — Y,|=¢| < PY, s+] 
<P[X, <y"]+P| LX, =elek 


Since X, = X, letting n — œ gives 


(25.8) P[ X <y’] <lim inf P[Y, <x] 


== im |S Up) [Y= 70) |b ee een 


Since P[X =y]=0 for all but countably many y, if P[X =x]=0, then y’ 
and y” can further be chosen so that P[X <y’] and P[X <y"] are arbitrarily 
near P[ X <x]; hence P[Y, <x] > P[X <x]. a 


Theorem 25.4 has a useful extension. Suppose that (X“), Y,) is a two-dimensional 
random vector. 


Theorem 25.5. If, for each u, X{® = X“ as n > œ, if X = X as u > œ, and if 


(25.9) lim lim sup P[|X{° — ¥,|>€] =0 
n 


for positive e, then Y, > X. 


Proof. Replace X, by X{® in (25.7). If PLX =y']=0=P[X™=y’ : 
=y"]=0=P[X“ =y"], letting n >% and then u > œ gives Osi) once ante Sale 
P[X =y]=0=P[X“ =y] for all but countably many y, the proof can be completed 


as before. Ss 


SECTION 25. WEAK CONVERGENCE 333 


Fundamental Theorems 


Some of the fundamental properties of weak convergence were established in 
Section 14. It was shown there that a sequence cannot have two distinct weak 
limits: If F, = F and F, = G, then F = G. The proof is simple: The hypothe- 
sis implies that F and G agree at their common points of continuity, hence at 
all but countably many points, and hence by right continuity at all points. 
Another simple fact is this: Jf lim, F,(d) = F(d) for d in a set D dense in R', 
then F, = F. Indeed, if F is continuous at x, there are in D points d’ and d” 
such that d' <x <d" and F(d")—F(d') <e, and it follows that the limits 
superior and inferior of F(x) are within e of F(x). 

For any probability measure on (R!, Z!) there is on some probability 
space a random variable having that measure as its distribution. Therefore, 
for probability measures satisfying 4, = u, there exist random variables Y, 
and Y having these measures as distributions and satisfying Y, = Y. Accord- 
ing to the following theorem, the Y, and Y can be constructed on the same 
probability space, and even in such a way that Y,(w) — Y(w) for every w—a 
condition much stronger than Y,= Y. This result, Skorohod’s theorem, 
makes possible very simple and transparent proofs of many important facts. 


Theorem 25.6. Suppose that pw, and u are probability measures on 
(R', Z') and u„= u. There exist random variables Y, and Y on a common 
probability space (Q, F, P) such that Y, has distribution w,,, Y has distribution 
u, and Y(w)— Y(w) for each o. 


Proor. For the probability space (Q, F, P), take Q = (0,1), let F con- 
sist of the Borel subsets of (0,1), and for P(A) take the Lebesgue measure 
of A. 

The construction is related to that in the proofs of Theorems 14.1 and 
20.4. Consider the distribution functions F, and F corresponding to u„ and 
H. For 0 <w <1, put Y,(w) = inflx: w < F,(x)] and Y(w) = inflx: w < F(x)). 
Since w < F(x) if and only if Y,(w) <x (see the argument following (14.5), 
Plo: ¥,(o) <x)=Plw: w < F(x)) =F,(x). Thus Y, has distribution function 
F,,,; similarly, Y has distribution function F. 

It remains to show that Y,(w)— Y(w). The idea is that Y, and Y are 
essentially inverse functions to F, and F; if the direct functions converge, so 
must the inverses. 

Suppose that 0 <w < 1. Given e, choose x so that Y(w)— e <x < Y(w) 
and p{x}=0. Then F(x) <w; F(x) > F(x) now implies that, for n large 
enough, F.(x)<w and hence Y(w)-— e <x <Y,(w). Thus liminf, Y (w) > 
Y(w). If w < w' and e is positive, choose a y for which Y(w') <y < Y(w’) +e 
and u{y}=0. Now w <w' <F(Y(a")) < F(y), and so, for n large enough, 
w <F,(y) and hence Y,(w)<y < Y(w') +e. Thus limsup,Y,(@) < Y(w’) if 
w <w’. Therefore, Y,(w) > Y(w) if Y is continuous at w. 


iis CONVERGENCE OF DISTRIBUTIONS 


Since Y is nondecreasing on (0, 1), it has at most countably many disconti- 
nuities. At discontinuity points w of Y, redefine Y,(w) = Y(w) = 0. With this 
Change, Y,(w) > Y(w) for every w. Since Y and the Y, have been altered 
Only on a set of Lebesgue measure 0, their distributions are still u„ and u. W 


Note that this proof uses the order structure of the real line in an essential 
way. The proof of the corresponding result in R“ is more complicated. 
The following mapping theorem is of very frequent use. 


Theorem 25.7. Suppose that h: R' > R' is measurable and that the set D, 
of its discontinuities is measurable.‘ If u, =p and w(D,)=0, then w,h~' = 
na 


Recall (see (13.7)) that uh™! has value w(h~'A) at A. 


Proor. Consider the random variables Y, and Y of Theorem 25.6. Since 
Y,(@) > Y(@), if Y(@)é€D, then A(Y,(w)) > h(Y(w)). Since Plw: Y(w)€ 
D,] = u(D,,) = 0, it follows that h(Y,(w)) > A(Y(w)) with probability 1. Hence 
h(Y,) = ACY) by Theorem 25.2. Since P[A(Y) € A] = PLY Eh 'A] = ph “'A), 
h(Y) has distribution wA~'; similarly, h(Y,) has distribution u,h~'. Thus 
h(Y,) = h(Y) is the same thing as wp, h7' => wh". = 


Because of the definition of convergence in distribution, this result has an 
equivalent statement in terms of random variables: 


Corollary 1. Jf X, =X and P[X © D,]=0, then h(X,) = h(X). 


Take X =a: 
Corollary 2. If X,, =a and h is continuous at a, then h(X,,) = h(a). 


Example 25.8. From xX, =X it follows directly by the theorem that 
aX, + b = aX + b. Suppose also that a, > a and b, > b. Then (a, —a)X, = 0 
by Example 25.7, and so (a,,X,, + b,) — (aX, + b) = 0. And now a, X, +b, = 
aX +b follows by Theorem 25.4: If X,=X, a,—a, and b,—b, then 
a,,X,, + b, = aX +b. This fact was stated and proved differently in Section 14 
—see Lemma 1 on p. 193. E] 


By definition, 1, = u means that the corresponding distribution functions 
converge weakly. The following theorem characterizes weak convergence 


'That D, lies in #! is generally obvious in applications. In point of fact, it always holds (even if 
h is not measurable): Let A(e,5) be the set of x for which there exist y and z such that 
|x —yl <6, |x —z|<6, and |h(y)—A(z)|>e. Then A(e,5) is open and D, = U.1;A(e, ô), 
where e and 6 range over the positive rationals. i 


SECTION 25. WEAK CONVERGENCE 335 


without reference to distribution functions. The boundary ðA of A consists 
of the points that are limits of sequences in A and are also limits of 
sequences in A‘; alternatively, 0A is the closure of A minus its interior. A 
set A is a w-continuity set if it is a Borel set and (dA) = 0. 


Theorem 25.8. The following three conditions are equivalent. 


Gi) u, > BS 
(ii) [fdu, > [fdu for every bounded, continuous real function f; 
(iii) w,(A) > wCA) for every w-continuity set A. 


Proor. Suppose that u„= pm, and consider the random variables Y, 
and Y of Theorem 25.6. Suppose that f is a bounded function such 
that u( D,) = 0, where D; is the set of points of discontinuity of F. From 
P[Y € D;] = w(D,) = 0 it follows that f(Y,) > f(Y) with probability 1, and so 
by change of variable (see (21.1)) and the bounded convergence theorem, 
[fdu, =Elf(Y,)] > ELF) = ffdu. Thus uw, > and u(D;)= 0 together 
imply that /fdu, — ffdu if f is bounded. In particular, (i) implies (ii). 
Further, if f=J,, then D,=0A, and from w(0A)=0 and pw, =p follows 
u,(A)= [fdu, > [fdu = (A). Thus (i) also implies (ii). 

Since ô(— œ, x] = {x}, obviously (iii) implies (i). It therefore remains only 
to deduce p, =p from (ii). Consider the corresponding distribution func- 
tions. Suppose that x <y, and let f(t) be 1 for t<x, 0 for t>y, and 
interpolate linearly on [x,y]: f(t)=(y—t)/(y —x) for x<t<y. Since 
F(x) < ffdu, and ffdu <F(y), it follows from (i) that limsup, F(x) < 
F(y); letting y |x shows that limsup, F(x) < F(x). Similarly, F(u) < 
liminf, F (x) for u<x and hence F(x —)<liminf, F(x). This implies 
convergence at continuity points. a 


The function f in this last part of the proof is uniformly continuous. 
Hence yp, > pw follows if {fdu, — {fdu for every bounded and uniformly 
continuous f. 


Example 25.9. The distributions in Example 25.3 satisfy uw, =u, but 
H,„(A) does not converge to (A) if A is the set of rationals. Hence this A 
cannot be a -continuity set; in fact, of course, ôA = R'. a 


The concept of weak convergence would be nearly useless if (25.2) were 
not allowed to fail when pu(@A)>0. Since F(x) — F(x —) = p{x} = 
u(ô(— œ, x]), it is therefore natural in the original definition to allow (25.1) to 
fail when x is not a continuity point of F. a 


336 CONVERGENCE OF DISTRIBUTIONS 


Helly’s Theorem 


One of the most frequently used results in analysis is the Helly selection 
theorem: 


Theorem 25.9. For every sequence {F} of distribution functions there exists 
a subsequence {F,, } and a nondecreasing , right-continuous function F such that 


lim, F,,,(%) = F(x) at continuity points x of F. 


Proor. An application of the diagonal method [A14] gives a sequence 
{n,} of integers along which the limit G(r) = lim, F,,(r) exists for every 
rational r. Define F(x) = inflG(r): x <r]. Clearly F is nondecreasing. 

To each x and e there is an r for which x <r and G(r) < F(x) +e. If 
x<y<r, then F(y) < G(r) < F(x)+«e. Hence F is continuous from the 
right. 

If F is continuous at x, choose y <x so that F(x) — e < F(y); now choose 
rational r and s so that y <r <x <s and G(s) < F(x) +. From F(x) —e < 
G(r) < G(s) < F(x) + and F (r) < F(x) < F,(s) it follows that as k goes to 
infinity F,, (x) has limits superior and inferior within e of F(x). a 


The F in this theorem necessarily satisfies 0 < F(x) < 1. But F need not 
be a distribution function: if F, has a unit jump at n, for example, F (x) =0 
is the only possibility. It is important to have a condition which ensures that 
for some subsequence the limit F is a distribution function. 

A sequence of probability measures u„ on (R', #') is said to be tight if 
for each e there exists a finite interval (a, b] such that w,(a, b] > 1 —e for all 
n. In terms of the corresponding distribution functions F_, the condition is 
that for each e there exist x and y such that F(x) <e and F(y)> 1-—e for 
all n. If w, is a unit mass at n, {w,,} is not tight in this sense—the mass of u, 
“escapes to infinity.” Tightness is a condition preventing this escape of mass. 


Theorem 25.10. Tightness is a necessary and sufficient condition that for 
every subsequence {,, } there exist a further subsequence { bony and a probabil- 
. . J 
ity measure p such that Unip Pw asja: 


Only the sufficiency of the condition in this theorem is used in what 
follows. 


Proor. Sufficiency. Apply Helly’s theorem to the subsequence {F,,} of 
corresponding distribution functions. There exists a further subsequence 
DEE, such that lim; F, „(x)= F(x) at continuity points of F, where F is 
nondecreasing and right-continuous. There exists by Theorem 12.4 a measure 
u on (R', #') such that u(a, b] = F(b) — F(a). Given e, choose a and b so 


that y (a,b]>z ime for all n, which is possible by tightness. By decreasing a 


SECTION 25. WEAK CONVERGENCE 337 


and increasing b, one can ensure that they are continuity points of F. But 
then u(a,b]> 1-— e. Therefore, u is a probability measure, and of course 
Mara iden 

Necessity. If {u,„} is not tight, there exists a positive e such that for each 
finite interval (a,b], u,(a,b]<1—e for some n. Choose n, so that 
i, CK, k]<1-—e. Suppose that some subsequence (Menge W of (un, } were to 
converge weakly to some probability measure w. Choose (a, b] so that 
ula) = pib} = 0 and pla,b]>1-—e. For large enough j, (a,b) Cc 
(—k(j), kG), and so 1—e>p,, (—k(j), KON = hna, b] > wa, b]. Thus 
u(a, b] < 1 — €e, a contradiction. [a] 


Corollary. If {u,„} is a tight sequence of probability measures, and if each 
subsequence that converges weakly at all converges weakly to the probability 
measure u, then u, >n. 


Proor. By the theorem, each subsequence {u,,,3 contains a further 
subsequence Wn, = converging weakly (j — œ) to some limit, and that limit 


must by hypothesis be u. Thus every subsequence (eaa contains a further 
subsequence Winns BY converging weakly to u. 

Suppose that a, = u is false. Then there exists some x such that u{x} = 
but u,(—%, x] does not converge to u(—~,x]. But then there exists a 
positive e such that |v, (—®, x]—p(—~, x]|>e for an infinite sequence {n,} 
of integers, and no subsequence of Wa, } can converge weakly to u. This 
contradiction shows that pw, = pw. B 


If uw, is a unit mass at x,„, then {y,,} is tight if and only if {x,,} is bounded. 
The theorem above and its corollary reduce in this case to standard facts 
about real line; see Example 25.4 and A10: tightness of sequences of 
probability measures is analogous to boundedness of sequences of real 
numbers. 


Example 25.10. Let w, be the normal distribution with mean m,, and 
variance g. If m, and a,’ are bounded, then the second moment of u, is 
bounded, and it follows by Markov’s inequality (21.12) that {w,,} is tight. The 
conclusion of Theorem 25.10 can also be checked directly: If {n,,;)} is chosen 
so that lim; m, =m and lim, NOA, \= ao”, then Heng. = H, Where u is normal 
with mean m aad variance ig (a unit mass at m if o* = 0). 

If m,, > b, then w,(b,») > 4; if m, <a, then p,(—%,a]> 3. Hence {u,)} 
cannot be tight if m, is unbounded. If m, is bounded, say by K, then 
LL, (—%, a]> v(—%,(a K a, ']), where v is kihe standard normal distribu- 
tion. If o,is unbounded, then v(—%,(a — K)o, ']—> 3 along some subse- 
quence, and {u„} cannot be tight. Thus a sequence of normal distributions is 


tight if and only if the means and variances are bounded. a 


338 CONVERGENCE OF DISTRIBUTIONS 
Integration to the Limit 
Theorem 25.11. If X, =X, then E[|X|] <liminf, El. 


Proor. Apply Skorohod’s Theorem 25.6 to the distributions of X, and 
X: There exist on a common probability space random variables Y, and Y 
such that Y= lim, Y, with probability 1, Y, has the distribution of X,, and Y 
has the distribution of X. By Fatou’s lemma, E[IY |] < lim inf„ ELIY,|]. Since 
|X| and |Y| have the same distribution, they have the same expected value 
(see (21.6)), and similarly for |X,,| and Y,]. z 


The random variables X,, are said to be uniformly integrable if 


(25.10) lim sup |X,,| dP = 0; 


ae mn “XG lea] 
see (16.21). This implies (see (16.22)) that 


(25.11) sup E[|X,,|] < ©. 


Theorem 25.12. Jf X, =X and the X,, are uniformly integrable, then X is 
integrable and 


(25.12) Bilal Bl 


Proor. Construct random variables Y, and Y as in the preceding proof. 
Since Y, — Y with probability 1 and the Y, are uniformly integrable in the 
sense of (16.21), E[ X,,] = ELY,,] > ELY ] = E[X] by Theorem 16.14. S 


If sup,, E[|X,|'**]< for some positive e, then the X, are uniformly 
integrable because 


(25.13) ip alae eee oe 


Since X, =X implies that X; = X” by Theorem 25.7, there is the following 
consequence of the theorem. 


Corollary. Let r be a positive integer. If X, = X and sup, EIX, t] <% 
where € > 0, then E[|X|'] < œ% and E[ X7] > E[X’]. ni 


et 


‘ee. 


The X, are also uniformly integrable if there is an integrable random 
variable Z such that P[|X,|>+]<P[|Z|>¢] for t>0, because then (21.10) | 


SECTION 25. WEAK CONVERGENCE 


gives 


339 


[|X lap < Z| aP. 
(1X, |2e] [IZ|=a] 


From this the dominated convergence theorem follows again. 


PROBLEMS 


25.1. 


25.2. 


25.3. 


25.4. 


(a) Show by example that distribution functions having densities can converge 
weakly even if the densities do not converge: Hint: Consider f,(x) = 1+ 
cos 27rmx on [0,1]. 

(b) Let f, be 2” times the indicator of the set of x in the unit interval for 
which d,,, (x) = +++ =d>,(x)=0, where d,(x) is the kth dyadic digit. Show 
that f(x) > 0 except on a set of Lebesgue measure 0; on this exceptional set, 
redefine f,(x)=0 for all n, so that f,(x)—0 everywhere. Show that the 
distributions corresponding to these densities converge weakly to Lebesgue 
measure confined to the unit interval. 

(c) Show that distributions with densities can converge weakly to a limit that 
has no density (even to a unit mass). 

(d) Show that discrete distributions can converge weakly to a distribution that 
has a density. 

(e) Construct an example, like that of Example 25.3, in which uw,(A) > pA) 
fails but in which all the measures come from continuous densities on [0, 1]. 


14.87 Give a simple proof of the Gilvenko—Cantelli theorem (Theorem 20.6) 
under the extra hypothesis that F is continuous. 


Initial digits. (a) Show that the first significant digit of a positive number x is d 
(in the scale of 10) if and only if {log,, x} lies between log, d and log,o(d + 1), 
d=1,...,9, where the braces denote fractional part. 

(b) For positive numbers x}, x5,..., let N,(d) be the number among the first 
n that have initial digit d. Show that 


(25.14) lim =N,(d) =logig(d +1) —logigd, d=1,...,9, 


if the sequence log,, x,, n =1,2,..., is uniformly distributed modulo 1. This is 
true, for example, of x, = 0” if logo 0 is irrational. 

(c) Let D, be the first significant digit of a positive random variable X,,. Show 
that 


(25.15) limP[D, =d]=log,(d+1)—logyd, d=1,...,9, 
n 
if {log,) X,,} = U, where U is uniformly distributed over the unit interval. 


Show that for each probability measure on the line there exist probability 
measures p,, with finite support such that u„ = p. Show further that u,{x} can 


340 CONVERGENCE OF DISTRIBUTIONS 


be taken rational and that each point in the support can be taken rational. 
Thus there exists a countable set of probability measures such that every u is 
the weak limit of some sequence from the set. The space of distribution 
functions is thus separable in the Lévy metric (see Problem 14.5). 


25.5. Show that (25.5) implies that P(X <x]4[X, <x) > 0 if PIX =x]=0. 


25.6. For arbitrary random variables X, there exist positive constants a, such that 
a,X,, = 0. 


25.7. Generalize Example 25.8 by showing for three-dimensional random vectors 
(A,,, B,, X,,) and constants a and b, a>0, that, if A,=a, B, =b, and 
X, =X, then A,X,+B,=aX +b. Hint: First show that if Y, =Y and 
D, = 0, then D,Y, = 0. 


25.8. Suppose that X,, =X and that h, and h are Borel functions. Let E be the set 
of x for which h,x,—hx fails for some sequence x, —x. Suppose that 
Ee" and P[X E€ E]=0. Show that h, X,, = hX. 


25.9. Suppose that the distributions of random variables X, and X have densities 
f, and f. Show that if f,(x) > f(x) for x outside a set of Lebesgue measure 0, 
then X, = X. 


25.10. t Suppose that X,, assumes as values y, + k6,, k =0,+1,..., where 6, > 0. 
Suppose that 6,, > 0 and that, if k„ is an integer varying with n in such a way 
that y, +k,6, >x, then P[X, = y, +k,,6,]5, ' > f(x), where f is the density 
of a random variable X. Show that X, >X. 


25.11. t Let S, have the binomial distribution with parameters n and p. Assume as 
known that 


1 era 


(25.16) P[S, =k, ](np(-p))'” > 


if (k,, — npXnp( — p)) '/* +x. Deduce the DeMoivre-Laplace theorem: (S, 
—np\Xnp( — p)! =N, where N has the standard normal distribution. 
This is a special case of the central limit theorem; see Section 27. 


25.12. Prove weak convergence in Example 25.3 by using Theorem 25.8 and the 
theory of the Riemann integral. 


25.13. (a) Show that probability measures satisfy », = y if u,(a,b] > ula, b] when- 
ever ufa} = p{b} = 0. 
(b) Show that, if /fdu, — {fdu for all continuous f with bounded support, 
then y, =p. 


— 


SECTION 25. WEAK CONVERGENCE 341 


25.14. 


25.15. 


25.16. 


25.17. 


25.18. 


25.19. 


25.20. 


T Let u be Lebesgue measure confined to the unit interval; let wu, corre- 
spond to a mass of x, ;—X, ;—, at some point in (x, ;_;,%,,;], where 0=x,9 
<X,1< °°: <x,, = 1. Show by considering the distribution functions that 
uw, =p if max; .,(X,,; Xn i1) 2 0. Deduce that a bounded Borel function 
continuous almost everywhere on the unit interval is Riemann integrable. See 
Problem 17.1. 


2.18 5.197 A function f of positive integers has distribution function F if F 
is the weak limit of the distribution function P,[m: f(m) <x] of f under the 
measure having probability 1/n at each of 1,...,n (see 2.34)). In this case 
D{m: f(m) <x]= F(x) (see (2.35)) for continuity points x of F. Show that 
o(m)/m (see (2.37)) has a distribution: 

(a) Show by the mapping theorem that it suffices to prove that f(m)= 
log(g(m)/m) = £,,6,(m) log(1 — 1/p) has a distribution. 

(b) Let f,(m)=%, .,5,(m)log( — 1/p), and show by (5.45) that f, has 
distribution function F(x) = P[L, <„X, log(1 — 1/p) <x], where the X, are 
independent random variables (one for each prime p) such that P[X, = 1]= 
1/p and P[X, =0]= 1— 1/p. 

(c) Show that LX, log(1 —1/p) converges with probability 1. Hint: Use 
Theorem 22.6. 

(d) Show that lim, _,,.sup,, E,[|f —f,,|] = 0 (see (5.46) for the notation). 


(e) Conclude by Markov’s inequality and Theorem 25.5 that f has the 
distribution of the sum in (c). 


For A E€ &' and T> 0, put A;(A) = A([—T, T] NA) /2T, where A is Lebesgue 
measure. The relative measure of A is 


(25.17) p(A) = lim A7(A), 


provided that this limit exists. This is a continuous analogue of density (see 
(2.35)) for sets of integers. A Borel function f has a distribution under A; if 
this converges weakly to F, then 


(25.18) p[ x: f(x) <u] =F(u) 


for continuity points u of F, and F is called the distribution function of f. 
Show that all periodic functions have distributions. 


Suppose that sup, /fdu, <% for a nonnegative f such that f(x)—>© as 
x — +. Show that {y,,} is tight. 


23.47 Show that the random variables A, and L, in Problems 23.3 and 23.4 
converge in distribution. Show that the moments converge. 


In the applications of Theorem 9.2, only a weaker result is actually needed: 
For each K there exists a positive a = a(K) such that if E[X]=0, E[X?] = $ 
and E[X*]<K, then P[X>0]>a. Prove this by using tightness and the 
corollary to Theorem 25.12. 


Find uniformly integrable random variables X,, for which there is no inte- 
grable Z satisfying P[|X,„| > t] < PIIZ| > t] for ¢>0. 


342 CONVERGENCE OF DISTRIBUTIONS 
SECTION 26. CHARACTERISTIC FUNCTIONS 


Definition 


The characteristic function of a probability measure u on the line is defined 
for real t by 


sty J” eud) 
=f. cos txu( dx) +if sin txu( dx); 


see the end of Section 16 for integrals of complex-valued functions." A 
random variable X with distribution u has characteristic function 


g(t) =E[e"*]= J eu(ar). 


The characteristic function is thus defined as the moment generating function 
but with the real argument s replaced by it; it has the advantage that it 
always exists because e‘ is bounded. The characteristic function in nonprob- 
abilistic contexts is called the Fourier transform. 

The characteristic function has three fundamental properties to be estab- 
lished here: 


(i) If u, and p, have respective characteristic functions g(t) and ¢,(¢), 
then w,* yu, has characteristic function ¢,(t)p,(t). Although convolution is 
essential to the study of sums of independent random variables, it is a 
complicated operation, and it is often simpler to study the products of the 
corresponding characteristic functions. 

(ii) The characteristic function uniquely determines the distribution. This 
shows that in studying the products in (i), no information is lost. 

(iii) From the pointwise convergence of characteristic functions follows 
the weak convergence of the corresponding distributions. This makes it 
possible, for example, to investigate the asymptotic distributions of sums of 
independent random variables by means of their characteristic functions. 


Moments and Derivatives 
It is convenient first to study the relation between a characteristic function 
and the moments of the distribution it comes from. 


tFrom complex variable theory only De Moivre’s formula and the simplest properties of the 
exponential function are needed here. sila 


SECTION 26. CHARACTERISTIC FUNCTIONS 343 


Of course, p(0) = 1, and by (16.30), |g(t)| < 1 for all t. By Theorem 16.8(), 
g(t) is continuous in t. In fact, |p(t + h) — g(t)| < fle** — 1|u(dx), and so it 
follows by the bounded convergence theorem that g(t) is uniformly continu- 
ous. 

In the following relations, versions of Taylor’s formula with remainder, x 
is assumed real. Integration by parts shows that 


r nis ah Die n+l is 
(26.1) f (es) e ds= aa twat O e" db, 
and it follows by induction that 


n aA k n+1 
(26.2) e= E bed + 


~ 


7 [io —s)"e'' ds 


for n > 0. Replace n by n — 1 in (26.1), solve for the integral on the right, 
and substitute this for the integral in (26.2); this gives 


ep (ee ee ee 
(26.3) e =E kl ao Game (e dS. 


Estimating the integrals in (26.2) and (26.3) (consider separately the cases 
x> 0 and x < 0) now leads to 


(26.4) 


n ENK 
: IX 
gp” y (2) 
k=0 


S min| 


ior 2\x|” 
(aiad) gael 


for n > 0. The first term on the right gives a sharp estimate for |x| small, the 
second a sharp estimate for |x| large. For n =0,1,2, the inequality special- 
izes to 


(26.45) leX — 1| < min{|xl, 2), 
(26.4,) |e" — (1 + ix)| < min{}x?, 2x}, 
(26.43) |ei* — (1 + ix — 4x?) | < minfġlxl’, x°}. 


If X has a moment of order n, it follows that 


n+l n 
<{min{ x| 2x }| 


(n+1)!’? n! 


(26.5) |@(t)- Èi GD Ex") 


344 CONVERGENCE OF DISTRIBUTIONS 


For any ¢ satisfying 


ni X n 
(26.6) lim Wehea =0, 
o(t) must therefore have the expansion 
> pars 
It 
(26.7) die È GERN 
k=0 


compare (21.22). If 
= |t|“ k \tX| 
Peels | =Ele ] <œ, 
k=0 


then (see (16.31)) (26.7) must hold. Thus (26.7) holds if X has a moment 
generating function over the whole line. 


Example 26.1. Since E[e''*']< if X has the standard normal distribu- 
tion, by (26.7) and (21.7) its characteristic function is 


(26.8) 
o0 E oœ k 
GD: N o Z ip 
AD- 1X3 X X (2k D= X A a EAA 
EN (2k)! Er k! 2 
This and (21.25) formally coincide if s = it. E 


If the power-series expansion (26.7) holds, the moments of X can be read 
off from it: 


(26.9) (0) =i*E[ X*]. 


This is the analogue of (21.23). It holds, however, under the weakest possible 
assumption, namely that E[|X*|] < œ. Indeed, 


t+h)-—g(t me Dae meee 
= — El ixe!*) = e| - ax] 


By (26.4,), the integrand on the right is dominated by 2| X | and goes to 0 with 
h; hence the expected value goes to 0 by the dominated convergence 
theorem. Thus ¢'(t) = E[iXe"*]. Repeating this argument inductively gives 


(26.10) e(t) =E| (ix) *e*| 


SECTION 26. CHARACTERISTIC FUNCTIONS 345 


if E[|X*|] <œ. Hence (26.9) holds if E[|X*|] <œ. The proof of uniform 
continuity for g(t) works for g(t) as well. 
If E[X?] is finite, then 


(26.11) g(t) =1+itE[X] -3t7E[X?] +0(t?),  t0. 


Indeed, by (26.4,), the error is at most t?£[min{|t||X|°, X2}], and as t > 0 
the integrand goes to 0 and is dominated by X?. Estimates of this kind are 
essential for proving limit theorems. 

The more moments u has, the more derivatives g has. This is one sense in 
which lightness of the tails of u is reflected by smoothness of ø. There are 
results which connect the behavior of g(t) as |t|—>œ with smoothness 
properties of u. The Riemann—Lebesgue theorem is the most important of 
these: 


Theorem 26.1. If uw has a density, then g(t) — 0 as |t|— œ. 


Proor. The problem is to prove for integrable f that /f(x)e‘* dx > 0 as 
|t| œ. There exists by Theorem 17.1 a step function g = L,a,1 ap a finite 
linear combination of indicators of intervals A, =(a,,b,], for which iif- 
g|dx<e. Now ff(x)e"*dx differs by at most e from fg(x)e™"* dx= 
L,a,(e”’k — e°) /it, and this goes to 0 as |t| > œ. E 


Independence 
The multiplicative property (21.28) of moment generating functions extends 
to characteristic functions. Suppose that X, and X, are independent random 
variables with characteristic functions g, and ¢). If Y,=cos X; and Z;= 
sin X,, then (Y,, Z,) and (Y,, Z,) are independent; by the rules far integrat 
ing A. ales functions, 
9i(t)—2(t) = (E[Y,] + iE[Z,])(ELY,] + iE[Z,]) 
=E(Y,JE[Y] —E[Z,JE[Z,] 
+ i(E[Y, JE[Z,] + E[Z, ]E[Y2]) 
= E|Y,Y, ~Z,Z, +.i(¥,Za + Zi¥_) | Be D1, 


This extends to sums of three or more: If X\,..., X, are independent, then 


(26.12) E[ett™h-i%e] = Tl Efe]; 
k=1 


346 CONVERGENCE OF DISTRIBUTIONS 


If X has characteristic function g(t), then aX +b has characteristic 
function 


(26.13) E[eieX+)] = eto (at). 


In particular, —X has characteristic function g(—t), which is the complex 
conjugate of g(t). 


Inversion and the Uniqueness Theorem 


A characteristic function g uniquely determines the measure u it comes 
from. This fundamental fact will be derived by means of an inversion formula 
through which u can in principle be recovered from ø. 

Define 


T sin x 


S(T) =f a, DO 


In Example 18.4 it is shown that 
i T 
(26.14) lim S(T) =F 


S(T) is therefore bounded. If sgn@ is +1, 0, or —1 as @ is positive, 0, or 
negative, then 


(26.15) 


pas dt = sgn@-S(T|@1), £ =0. 
Theorem 26.2. If the probability measure u has characteristic function 9, 
and if {a} = píb} = 0, then 


—ita —itb 


Sa 1. AE em Gee 
(26.16) (a,b) = Jim 35 fp (1) at. 


Distinct measures cannot have the same characteristic function. 


Note: By (26.4,) the integrand here converges as t > 0 to b —a, which is 
to be taken as its value for t = 0. For fixed a and b the integrand is thus 
continuous in ¢, and by (26.4,) it is bounded. If u is a unit mass at 0, then 
g(t) = 1 and the integral in (26.16) cannot be extended over the whole line. 


Proor. The inversion formula will imply uniqueness: It will imply that if 
u and v have the same characteristic function, then pu(a,b]=v(a,b] if 
u{a} = v{a} = u{b} = v{b} = 0; but such intervals (a, b] form a m-system gen- 
erating 2!. 


SECTION 26. CHARACTERISTIC FUNCTIONS 347 
Denote by /; the quantity inside the limit in (26.16). By Fubini’s theorem 


T e- a) _ pit(x— b) 


28.19) eae re 


This interchange is legitimate because the double integral extends over a set 
of finite product measure and by (26.4,) the integrand is bounded by |b — al. 
Rewrite the integrand by DeMoivre’s formula. Since sin s and cos s are odd 
and even, respectively, (26.15) gives 


dt |u(dx). 


r= f | PD sr: Ix- aly — EAE 9) s(r-x— a) u( dx). 


The integrand here is bounded and converges as T > œ to the function 


0 ifx<a, 

ieee =O), 
(26.18) W, p(X iH lileaf aab 

+ ifx=b, 

iti KI. 


Thus J; — fy, , du, which implies that (26.16) holds if u{a} = u{b}=0. m 
The inversion formula contains further information. Suppose that 


(26.19) ie |p(t)|dt <œ. 


In this case the integral in (26.16) can be extended over R'. By (26.4,), 


e ith cdl e ita 


it 


j Ve e-a) ie 1| 
|t| 


<|b-—al; 


therefore, u(a,b)<(b—a){®.|p(t)| dt, and there can be no point masses. 
By (26.16), the corresponding distribution function satisfies 


P(A thy (9) o e —itx _ eTii x+h) 
h FÍ. Rd r e(t) dt 


(whether h is positive or negative). The integrand is by (26.4) dominated by 
|p(t)| and goes to e““g(t) as h > 0. Therefore, F has diate 


(26.20) f(x) = a2 f eot) dt. 


348 CONVERGENCE OF DISTRIBUTIONS 


Since f is continuous for the same reason ¢ is, it integrates to F by the 
fundamental theorem of the calculus (see (17.6)). Thus (26.19) implies that u 
has the continuous density (26.20). Moreover, this is the only continuous 
density. In this result, as in the Riemann—Lebesgue theorem, conditions on 
the size of g(t) for large |t| are connected with smoothness properties of pw. 

The inversion formula (26.20) has many applications. In the first place, it 
can be used for a new derivation of (26.14). As pointed out in Example 17.3, 
the existence of the limit in (26.14) is easy to prove. Denote this limit 
temporarily by m)/2—without assuming that 7,=7. Then (26.16) and 
(26.20) follow as before if m is replaced by mọ. Applying the latter to the 
standard normal density (see (26.8)) gives 


1 


és 1 er 
7 ee + J e itte-t /? dt, 
OY =e 


where the z on the left is that of analysis and geometry—it comes ultimately 
from the quadrature (18.10). An application of (26.8) with x and ¢ inter- 
changed reduces the right side of (26.21) to (V2 /27,)e~* /”, and therefore 
To does equal r. 

Consider the densities in the table. The characteristic function for the 
normal distribution has already been calculated. For the uniform distribution 
over (0,1), the computation is of course straightforward; note that in this case 
the density cannot be recovered from (26.20), because g(t) is not integrable; 
this is reflected in the fact that the density has discontinuities at 0 and 1. 


Characteristic 
Distribution Density Interval Function 
1 2 2 
1. Normal er? =% <x% <0 eA 
V2 
; e"—1 
2. Uniform 1 0<x<1 7 
3. Exponential ere 0<x <0 rS 
4. Double 1 
exponential je! =% <x <% — 
or Laplace Pe? 
7” 4 
5. Cauch = —-w<x<m ew ltl 
Á T1+x7 
— cost 
6. Triangular 1 —|x| -1<x<1l 2 72 
1 1— cosx 
7, EFA —% <x <1% (1 -lt k-11, 9t) 


SECTION 26. CHARACTERISTIC FUNCTIONS 349 


The characteristic function for the exponential distribution is easily calcu- 
lated; compare Example 21.3. As for the double exponential or Laplace 
distribution, e '*'e"’* integrates over (0,) to (1 — it)~' and over (—~,0) to 
(1 + it)~', which gives the result. By (26.20), then, 


ele =f" qr titel 
WS _ © ae 


For x = 0 this gives the standard integral /*, dt/(1 + t?) = m; see Example 
17.5. Thus the Cauchy density in the table integrates to 1 and has character- 
istic function e "|. This distribution has no first moment, and the characteris- 
tic function is not differentiable at the origin. 

A straightforward integration shows that the triangular density has the 
characteristic function given in the table, and by (26.20), 


— COS t 


top” habpyi 


For x = 0 this isf? {1 — cos t)t~* dt = m; hence the last line of the table. 
Each density and characteristic function in the table can be transformed 
by (26.13), which gives a family of distributions. 


The Continuity Theorem 


Because of (26.12), the characteristic function provides a powerful means of 
studying the distributions of sums of independent random variables. It is 
often easier to work with products of characteristic functions than with 
convolutions, and knowing the characteristic function of the sum is by 
Theorem 26.2 in principle the same thing as knowing the distribution itself, 
Because of the following continuity theorem, characteristic functions can be 
used to study limit distributions. 


Theorem 26.3. Let y,,,u be probability measures with characteristic func- 


tions p,, p. A necessary and sufficient condition for y, = n is that o (t) > g(t) 
for each t. 


Proor. Necessity. For each t, e'* has bounded modulus and is continu- 
ous in x. The necessity therefore follows by an application of Theorem 25.8 
(to the real and imaginary parts of e‘‘*), 


350 CONVERGENCE OF DISTRIBUTIONS 


Sufficiency. By Fubini’s theorem, 
1 pu o f1 pu "i 
(26.22) a RAL ZOL f" | f (1 — eit ) at |ual) 
= sin ux 
=2f (1 - M )un(de) 


i rem al rari) a) 


2 
= Mn xil[xl27 f 


(Note that the first integral is real.) Since gy is continuous at the origin and 
g(0) = 1, there is for positive e a u for which u~ '/“ (1 — g(t)) dt < e. Since 
P, converges to o, the bounded convergence theorem implies that there 
exists an no such that u~'/“(1—9,(t))dt<2e for n>no. If a=2/u in 
(26.22), then w,[x: |x|>a]<2e for n >n. Increasing a if necessary will 
ensure that this inequality also holds for the finitely many n preceding np. 
Therefore, {,} is tight. 

By the corollary to Theorem 25.10, u„ = u will follow if it is shown that 
each subsequence {u,, } that converges weakly at all converges weakly to l. 
But if w,, =v as k > œ, then by the necessity half of the theorem, already 
proved, v has characteristic function lim, 9, (t) = g(t). By Theorem 26.2, v 
and u must coincide. a 


Two corollaries, interesting in themselves, will make clearer the structure 
of the proof of sufficiency given above. In each, let H„ be probability 
measures on the line with characteristic functions ¢,. 


Corollary 1. Suppose that lim, y,(t)=g(t) for each t, where the limit 
function g is continuous at 0. Then there exists a u such that p, =p, and p 
has characteristic function g. 


Proor. The point of the corollary is that g is not assumed at the outset 
to be a characteristic function. But in the argument following (26.22), only 
(0) = 1 and the continuity of g at 0 were used; hence {u,,} is tight under the 
present hypothesis. If p, „>v as ko, then v must have characteristic 
function lim, Pn (t) = g(t). Thus g is, in fact, a characteristic function, and 
the proof goes through as before. a 


SECTION 26. CHARACTERISTIC FUNCTIONS 351 


In this proof the continuity of g was used to establish tightness. Hence if 
{u,,} is assumed tight in the first place, the hypothesis of continuity can be 
suppressed: 


Corollary 2. Suppose that lim, 9,(t) = g(t) exists for each t and that {w,} 
is tight. Then there exists a u such that p, =m, and p has characteristic 
function g. 


This second corollary applies, for example, if the w, have a common 
bounded support. 


Example 26.2. If uw, is the uniform distribution over (—n, n), its charac- 
teristic function is (nt)~' sin tn for t + 0, and hence it converges to I(t). In 
this case {u,,} is not tight, the limit function is not continuous at 0, and w, 
does not converge weakly. 5 


Fourier Series* 


Let u be a probability measure on Æ! that is supported by [0,277]. Its Fourier 
coefficients are defined by 


(26.23) cma fe" ude)... m=O pcan 2a 
0 


These coefficients, the values of the characteristic function for integer arguments, 
suffice to determine u except for the weights it may put at 0 and 27. The relation 
between yp and its Fourier coefficients can be expressed formally by 


(26.24) w(de) ~ 7- 7 ce ds: 


l=—o 


if the (dx) in (26.23) is replaced by the right side of (26.24), and if the sum over / is 
interchanged with the integral, the result is a formal identity. 
To see how to recover u from its Fourier coefficients, consider the symmetric 
partial sums s„(t)=(2mr) E} satie -iH and their Cesàro averages g,,(t) = 
n ENAA, From the trigonometric identity [A24] 


in? d 
(26.25) Th aus lhe an Aa oe 


it follows that 


(26.26) PG T rem [9 pian). 


sin? $(x — t) 


“This topic may be omitted. 


352 CONVERGENCE OF DISTRIBUTIONS 


If u is (277)! times Lebesgue measure confined to [0,277], then cy = 1 and c,, =0 
for m + 0, so that Tmt) = S(t) = (22) '; this gives the identity 


i p7 sin ams , 
(26.27) Semi T Ee =1. 


Suppose that 0 <a <b < 27, and integrate (26.26) over (a, b). Fubini’s theorem 
(the integrand is nonnegative) and a change of variable lead to 


(26.28) [Pom de f°" ag [OS ds |uc) 


2TM Ja- sin? 4s 


The denominator in (26.27) is bounded away from 0 outside (— ô, 6), and so as m goes 
to © with 6 fixed (0 < ô < r), 


e ° 2: 1 
1 sin? +ms 1 sin? 4ms 
2 Al a as + 0, 2 -ai ds > 1. 
TIM “5 <\s|<a Sin’ 55 TM Isic sin’ 5s 


Therefore, the expression in brackets in (26.28) goes to 0 if 0<x <a or b<x<2r, 


and it goes to 1 if a <x <b; and because of (26.27), it is bounded by 1. It follows by 
the bounded convergence theorem that 


2 


(26.29) u(a,b] = lim f ont) at 


if u{a} = w{b} = 0 and 0<a<b<2r. 

This is the analogue of (26.16). If u and v have the s 
follows from (26.29) that w(A)=v(A) for A C(0,27) 
v{0, 27}. It is clear from periodicity that the coefficients ( 
and u{2r} are altered but {0} + {27} is held constant. 

Suppose that 2, is supported by [0,27] and has coefficients c, and suppose that 
lim, c =c,, for all m. Since {u,,} is tight, uw, = will hold if LM, >v (k >æ) 
implies v = u. But in this case v and pu have the same coefficients c,,, and hence 
they are identical except perhaps in the way they split the mass v{0,27} = 
HAO, 277} between the points 0 and 27. But this poses no problem if {0,27} =0: If 
lim, cf? =c,, for all m and {0} = {27} = 0, then ML, => LL. 


ame Fourier coefficients, it 
and hence that {0,27} = 
26.23) are unchanged if {0} 


Example 26.3. If u is (27)! times Lebesgue measure confined to the interval 
[0,27], the condition is that lim, c” = 0 for m #0. Let X1,X2,... be a sequence of 
reals, and let ~,, put mass n~’ at each point 2m{x,},1<k <n, where {x} =x- lx] 
denotes fractional part. This is the probability measure (25.3) rescaled to [0,27]. The 


sequence x4, X2,... is uniformly distributed modulo 1 if and only if 
n n 
1 oy e2mi{x,jm a 1 D e2mixym >0 
n n 


for m # 0. This is Weyl’s criterion. 


SECTION 26. CHARACTERISTIC FUNCTIONS 353 


If x, = k0, where 0 is irrational, then exp(277i0m) + 1 for m + 0 and hence 


n — p2mindm 
5 Y e?7ikêm — = gamiem Pot 3 PET 
k=1 L= 
Thus 6,26,30,... is uniformly distributed modulo 1 if 0 is irrational, which gives 
another proof of Theorem 25.1. a 


PROBLEMS 


26.1. A random variable has a lattice distribution if for some a and b, b> 0, the 
lattice [a + nb: n=0,+1,...] supports the distribution of X. Let X have 
characteristic function o. 

(a) Show that a necessary condition for X to have a lattice distribution is that 
\p(t)| = 1 for some t + 0. 
(b) Show that the condition is sufficient as well. 


(c) Suppose that |p(t)|=|¢(t’)|}=1 for incommensurable t and t’ (t 0, 
t' #0, t/t’ irrational). Show that P[X =c]=1 for some constant c. 


26.2. If u(— œ, x] =pu[—x,) for all x (which implies that w(A) = p(—A) for all 
A EP’), then u is symmetric. Show that this holds if and only if the 
characteristic function is real. 


26.3. Consider functions g that are real and nonnegative and satisfy g(—t)= 
g(t) and ¢(0) = 1. 
(a) Suppose that d,,d,,... are positive and LZ_,d, = œ, that s,>s,> --- > 
0 and lim, s, = 0, and that L7_,s,d, = 1. Let g be the convex polygon whose 
successive sides have slopes —s,, —Sz,... and lengths d,,d,,... when pro- 
jected on the horizontal axis: œ has value 1 — Ergad, att, =dit niit dy. If 
S„ = 0, there are in effect only n sides. Let p= — aD kern) be the 
characteristic function in the last line in the table on p. 348, and show that g(t) 
is a convex combination of the characteristic functions (t/t) and hence is 
itself a characteristic function. 
(b) Pólya’s criterion. Show that ¢ is a characteristic function if it is even and 
continuous and, on [0,~), nonincreasing and convex (¢(0) = 1). 


354 


26.4. 


26.5. 


26.6. 


26.7. 


26.8. 


26.9. 


26.10. 


26.11. 


26.12. 


26.13. 


CONVERGENCE OF DISTRIBUTIONS 


T Let y, and g, be characteristic functions, and show that the set A = fe: 
p(t) = p,(t)] is closed, contains 0, and is symmetric about 0. Show that every 
set with these three properties can be such an A. What does this say about the 
uniqueness theorem? 


Show by Theorem 26.1 and integration by parts that if u has a density f with 
integrable derivative f’, then g(t) = o(t~') as |t| > ©. Extend to higher deriva- 
tives. 


Show for independent random variables uniformly distributed over (—1, +1) 
that X, +- +X, has density m~ '/f((sin t)/t)" cos txdt for n > 2. 


21.177 Uniqueness theorem for moment generating functions. Suppose that F 
has a moment generating function in (—5o, Sọ), Sọ > 0. From the fact that 
[~.e7* dF(x) is analytic in the strip —sọ < Re z <5 9, prove that the moment 
generating function determines F. Show that it is enough that the moment 
generating function exist in [0, sọ), Sọ > 0. 


21.20 26.77 Show that the gamma density (20.47) has characteristic func- 
tion 


ET = exp| -u Iog( 1 — =) ; 


where the logarithm is the principal part. Show that /je**f(x;a,u)dx is 
analytic for Re z <a. 


Use characteristic functions for a simple proof that the family of Cauchy 
distributions defined by (20.45) is closed under convolution; compare the 
argument in Problem 20.14(a). Do the same for the normal distribution 
(compare Example 20.6) and for the Poisson and gamma distributions. 


Suppose that F, = F and that the characteristic functions are dominated by an 
integrable function. Show that F has a density that is the limit of the densities 
of the F 


Show for all a and b that the right side of (26.16) is u(a,b) + tufa} + Sud). 


By the kind of argument leading to (26.16), show that 


ees i 
(26.30) ula) = lim gpf eol) de. 


T Let x,,x2,... be the points of positive p-measure. By the following steps 
prove that 


(26.31) dim arf lor dt = E (ux) 


SECTION 26. CHARACTERISTIC FUNCTIONS 355 


26.14. 


26.15. 


26.16. 


26.17. 


26.18. 


26.19. 


26.20. 


Let X and Y be independent and have characteristic function ¢. 

(a) Show by (26.30) that the left side of (26.31) is P[X — Y = 0]. 

(b) Show (Theorem 20.3) that PIX - Y = 0] = [° PIX = y]u(dy) = 
L,(ulx,))?. 


t Show that u has no point masses if y7(t) is integrable. 


(a) Show that if {u,} is tight, then the characteristic functions o,(t) are 
uniformly equicontinuous (for each e there is a ô such that |s — t| < ô implies 
that |¢,(s) — 9, (t)| < € for all n). 

(b) Show that u„ = implies that 9,(t) > g(t) uniformly on bounded sets. 


(c) Show that the convergence in part (b) need not be uniform over the entire 
line. 


14.5 26.157 For distribution functions F and G, define d’(F,G) = 
sup, lolt) — W(t)|/C. + |t|), where ~ and w are the corresponding characteristic 
functions. Show that this is a metric and equivalent to the Lévy metric. 


25.167 A real function f has mean value 


(26.32) MLF(x)] = lim zf f(2) de, 


provided that f is integrable over each [—T,T] and the limit exists. 


(a) Show that, if f is bounded and e'/ has a mean value for each t, then f 
has a distribution. in the sense of (25.18). 


(b) Show that 


a 
(265%) Mle at ee 0: 


Of course, f(x) =x has no distribution. 


Suppose that X is irrational with probability 1. Let u„ be the distribution of 
the fractional part {nX}. Use the continuity theorem and Theorem 25.1 to 
show that n~'L/'_,, converges weakly to the uniform distribution on [0, 1). 


25.137 The uniqueness theorem for characteristic functions can be derived 
from the Weierstrass approximation theorem. Fill in the details of the follow- 
ing argument. Let u and v be probability measures on the line. For continuous 
f with bounded support choose a so that u(—a,a) and v(—a, a) are nearly 1 
and f vanishes outside (—a, a). Let g be periodic and agree with f in (—a, a), 
and by the Weierstrass theorem uniformly approximate g(x) by a trigonomet- 
ric sum p(x) = L?_,a,e!**. If p and v have the same characteristic function, 
then [fdu = fgdu = [pdu = [pdv = [gdv = ffdv. 


Use the continuity theorem to prove the result in Example 25.2 concerning the 
convergence of the binomial distribution to the Poisson. 


356 CONVERGENCE OF DISTRIBUTIONS 


26.21. According to Example 25.8, if X, = X, a, >a, and b, >b, then a, X,, + b, = 
aX + b. Prove this by means of characteristic functions. 


26.22. 26.1 26.151 According to Theorem 14.2, if X, =X and a,X, +b, =Y, 
where a, >0 and the distributions of X and Y are nondegenerate, then 
a, > a> 0, b, >b, and aX +b and Y have the same distribution. Prove this 
by characteristic functions. Let ¢,,9,y be the characteristic functions of 
KA, Y. 
(a) Show that |ọ„(a„t)| > |yW(t)| uniformly on bounded sets and hence that a, 
cannot converge to 0 along a subsequence. 
(b) Interchange the roles of g and w and show that a, cannot converge to 
infinity along a subsequence. 
(c) Show that a, converges to some a > 0. 


(d) Show that e”’» > W(t)/p(at) in a neighborhood of 0 and hence that 
jensen ds > J[§(b(s)/e(as)) ds. Conclude that b, converges. 


26.23. Prove a continuity theorem for moment generating functions as defined by 
(22.4) for probability measures on [0,). For uniqueness, see Theorem 22.2; 
the analogue of (26.22) is 


SS Q-M()) ds> u( 7). 


26.24. 26.47 Show by example that the values (m) of the characteristic function at 
integer arguments may not determine the distribution if it is not supported by 
[0,27]. 


26.25. If f is integrable over [0,27], define its Fourier coefficients as (¢7e!""* f(x) dx. 
Show that these coefficients uniquely determine f up to sets of measure 0. 


26.26. 19.8 26.251 Show that the trigonometric system (19.17) is complete. 

26.27. The Fourier-series analogue of the condition (26.19) is L,,|c,,| <œ. Show that 
it implies u has density f(x)=(27)~'L,,c,e'"* on [0,27], where f is 
continuous and f(0)=f(27r). This is the analogue of the inversion formula 
(26.20). 


26.28. t Show that 


m? 1 GOR 
(m-xa>+4y i, O<x<2r. 


m=1 
Show that L¥,-11/m? = 17/6 and L%_ (—1)"*! /m? = 92/12, 


26.29. (a) Suppose X’ and X” are independent random variables with values in 
[0,27], and let X be X’ + X” reduced module 27. Show that the correspond- 
ing Fourier coefficients satisfy c,, =c’,,c”,. 


(b) Show that if one or the other of X’ and X” is uniformly distributed, so 


is X. 
E 


SECTION 27. THE CENTRAL LIMIT THEOREM 357 


26.30. 26.257 The theory of Fourier series can be carried over from [0,27] to the 
unit circle in the complex plane with normalized circular Lebesgue measure P. 
The circular functions e'”* become the powers œw”, and an integrable f is 
determined to within sets of measure 0 by its Fourier coefficients c,, = 
fow” f(w)P(dw). Suppose that A is invariant under the rotation through the 
angle argc (Example 24.4). Find a relation on the Fourier coefficients of 14, 
and conclude that the rotation is ergodic if c is not a root of unity. Compare 
the proof on p. 316. 


SECTION 27. THE CENTRAL LIMIT THEOREM 


Identically Distributed Summands 


The central limit theorem says roughly that the sum of many independent 
random variables will be approximately normally distributed if each sum- 
mand has high probability of being small. Theorem 27.1, the Lindeberg—Lévy 
theorem, will give an idea of the techniques and hypotheses needed for the 
more general results that follow. 


Throughout, N will denote a random variable with the standard normal 
distribution: 


(27.1) P[NEA]= = anaes 
T “A 


Theorem 27.1. Suppose that {X,,} is an independent sequence of random 
variables having the same distribution with mean c and finite positive variance 
o°. If S, =X, +: +X,, then 


(27.2) IN 


oyn 


By the argument in Example 25.7, (27.2) implies that n~ 'S„ =c. The 
central limit theorem and the strong law of large numbers thus refine the 
weak law of large numbers in different directions. 

Since Theorem 27.1 is a special case of Theorem 27.2, no proof is really 
necessary. To understand the methods of this section, however, consider the 
special case in which X, takes the values +1 with probability 1/2 each. 
Each X, then has characteristic function g(t)= że” + 3e~" = cost. By 
(26.12) and (26.13), S„/ Vn has characteristic function e"(t/ Vn), and so, by 
the continuity theorem, the problem is to show that cos” t/ Vn > Efe] = 
e~/?, or that n logcos t /Vn (well defined for large n) goes to — it?, But 
this follows by 1’Hopital’s rule: Let t/ Vn =x go continuously to 0. i 


358 CONVERGENCE OF DISTRIBUTIONS 


_ For a proof closer in spirit to those that follow, note that (26.5) for n = 2 
gives |p(t) — (1 — 447) < |t|? (|X, | < 1). Therefore, 


dtig 


(27.3) af 


Rather than take logarithms, use (27.5) below, which gives (n large) 


TER cae asl 
Y Vn 2n 
But of course (1—1t?/2n)">e~"/, which completes the proof for this 
special case. 


Logarithms for complex arguments can be avoided by use of the following 
simple lemma. 


Lemma 1. Let z,,...,Z,, and W,,...,W,, be complex numbers of modulus 
at most 1; then 


m 


(27.5) [Zi Zm WwW, WS $ Ize- wil. 
k=1 
Proor. This follows by induction from z, --- Zm 7 Wi 000 w, = 
e wee WAZ z Wo a w a 


Two illustrations of Theorem 27.1: 


Example 27.1. In the classical De Moivre-Laplace theorem, X, takes the 
values 1 and 0 with probabilities p and q = 1 — p, so that c =p, and a? = pq. 
Here S, is the number of successes in n Bernoulli trials, and (S, — 


np)/ ynpq =N. = 


Example 27.2. Suppose that one wants to estimate the parameter a of an 
exponential distribution (20.10) on the basis of an independent sample 
X,,.--,X, AS n > œ the sample mean X, =n~'D"_,X, converges in prob- 
ability to the mean 1/a of the distribution, and hence it is natural to use 
1/X,, to estimate a itself. How goad is the estimate? The variance of the 
exponential distribution being 1/a? (Example 21.3), avn (X, — 1/a)=N by 
the Lindeberg-Lévy theorem. Thus x, is approximately normally distributed 
with mean 1/a and standard deviation 1/avn. 

By Skorohod’s Theorem 25.6 there exist on a single probability space 
random variables Y. and Y having the respective distributions of A; and N 
and satisfying avn Fla) = 1/a) > Y(w) for each w. Now Y (w) > 1/a and 

a` 'Vn (Y (w)! — a) = avn (aq! — Y,(w))/aY,(w) > —Y(w). Since —Y has 


SECTION 27. THE CENTRAL LIMIT THEOREM 359 


the distribution of N and Y, has the distribution of X,,, it follows that 


thus 1 IX, is approximately normally distributed with mean «æ and standard 
deviation a/vn. In effect, 1/X, has been studied through the local linear 
approximation to the function 1 /x. This is called the delta method. a 


The Lindeberg and Lyapounov Theorems 


Suppose that for each n 
(27.6) Xip tA 


are independent; the probability space for the sequence may change with n. 
Such a collection is called a triangular array of random variables. Put 
Sa =X, + +++ +X,,,.. Theorem 27.1 covers the special case in which r,=n 
and Ka k= Xy- Example 6.3 on the number of cycles in a random permuta- 
tion shows that the idea of triangular array is natural and useful. The central 
limit theorem for triangular arrays will be applied in Example 27.3 to the 
same array. 

To establish the asymptotic normality of S, by means of the ideas in the 
preceding proof requires expanding the characteristic function of each X,, 
to second-order terms and estimating the remainder. Suppose that the eae 
are 0 and the variances are finite; write 


(27.7) AA al =0, orage E| reli? iaa Daa 


ie assumption that X,, has mean 0 entails no loss of generality. Assume 
s;>0 for large n. A successful remainder estimate is possible under the 
assumption of the Lindeberg condition: 


; Al 
(27.8) lim > X? dP =0 
et A Sn |X, ZES, 
for e >Q. 
Theorem 27.2. Suppose that for each n the sequence X,,,...,X 


ary 


is independent and satisfies (27.7). If (27.8) holds for all posities pa then 
S,/5, >N. 


360 CONVERGENCE OF DISTRIBUTIONS 


This theorem contains the preceding one: Suppose that X,, =X, and 
r, =n, where the entire sequence (X;,)} is independent and the X, all have 
the same distribution with mean 0 and variance a”. Then (27.8) reduces to 


(27.9) lim == X? dP =0, 


which holds because [|X,| => eavn || Ø as nfo. 


PROOF OF THE THEOREM. Replacing X,, by X,,,/5, shows that there is 
no loss of generality in assuming 


(27.10) oS i) Onna 
By (26.4,), 

|e" — (1 + itx — $t?x?)|< min{|tx|*, |txl*}. 
Therefore, the characteristic function 9,, of X,, satisfies 
(27.11) | Pnx(t) — (1 — 3t70,3,)| < E|min{leX,,|,1tX,41}] . 


Note that the expected value is finite. 
For positive e the right side of (27.11) is at most 


iy 


IEE 


Xal dP + | tX,4l° dP < elt oh +1? f X? dP. 
(Xank 2€ LX 


nk 
nkl2€ 


Since the ø, add to 1 and e is arbitrary, it follows by the Lindeberg 
condition that 


(27.12) È |ene(t) ~ (1 — $2043,)| +0 
k=1 


for each fixed t. The objective now is to show that 


r 


i (1 — 3t70,7,) + 0(1) 


da Teed) 
k=l 


Fi ; 
[I eo? +0(1) =e7? 2 + 0(1). sd 
k=1 et 


SECTION 27. THE CENTRAL LIMIT THEOREM 361 


For e positive, 


Ta <€? + An dP, 
|X, 26 


and so it follows by the Lindeberg condition (recall that s, is now 1) that 


(27.14) max o> 0. 


1<k<r, 


For large enough n, 1-— 4t?o,4 are all between 0 and 1, and by (27.5), 
T1i_,¢,,(t) and T1jr_,(1 — $t7o,7,) differ by at most the sum in (27.12). This 
establishes the first of the asymptotic relations in (27.13). 

Now (27.5) also implies that 


Fn ln 
Mec x I CES ace ES y |e7ttoi/2 = 1 ot oak 
- -= k=1 


For complex z, 


a k2 
(27.15) je? = izl alz Hi <|zl’el#l. 
k=2 j 


Using this in the right member of the preceding inequality bounds it by 
ETE o by (27.14) and (27.10), this sum goes to 0, from which the 
second equality in (27.13) follows. BE 


It is shown in the next section (Example 28.4) that if the independent array {X,,,} 
satisfies (27.7), and if S,/s,=N, then the Lindeberg condition holds, provided 


max, <, 0,4,/s, > 0. But this converse fails without the extra condition: Take X,, = 
ap > 


X,, normal with mean 0 and variance o,, = of, where of = 1 and g} =ns2_,. 
Example 27.3. Goncharov’s theorem. Consider the sum S, =ÈXk-1Xnk in 
Example 6.3. Here S, is the number of cycles in a random permutation on n 


letters, the X,, are independent, and 


1 


Pixie Alan rar" 


1—P[X,, =]. 
The mean m, is L, = Lj_,k~', and the variance s} is L, + O(1). Lindeberg’s 
condition for X,,—(n—k+1)7~' is easily verified because these random 
variables are bounded by 1. 

The theorem gives (S,, — L,,)/s,, = N. Now, in fact, L„ = log n + O(1), and 
so (see Example 25.8) the sum can be renormalized: (S„ — log n)/ Vlogn >N. _ 


2ce CONVERGENCE OF DISTRIBUTIONS 


Suppose that the |X wel > are integrable for some positive ô and that 
Lyapounov’s condition 


: a 446] a 
(27.16) lim x ab llX ,**] =0 


holds. Then Lindeberg’s condition follows because the sum in (27.8) is 
bounded by 


ll LXer 1 
a Ck a sah a 


aot 
s? j 
=) | n IX, n1Z€S, € Sn 


Hence Theorem 27.2 has this corollary: 


Theorem 27.3. Suppose that for each n the sequence Xps- -> Xnr, 1S 
independent and satisfies (27.7). If (27.16) holds for some positive 6, then 
S,/Sp > N. 


Example 27.4. Suppose that X,, X,,... are independent and uniformly 
bounded and have mean 0. If the variance s? of S, =X, + --- +X, goes to 
œ, then S,/s, > N: If K bounds the X,, then 


n n 2 
1 5 KE|X2] k 
Pa SE|Ix,)"] s E anal NG 
which is Lyapounov’s condition for 6 = 1. E 


Example 27.5. Elements are drawn from a population of size n, randomly and 
with replacement, until the number of distinct elements that have been sampled is r,, 
where 1 <r, <n. Let S, be the drawing on which this first happens. A coupon 
collector requires §,, purchases to fill out a given portion of the complete set. Suppose 
that r, varies with n in such a way that r,/n >p, 0 <p < 1. What is the approximate 
distribution of S,,? 

Let Y, be the trial on which success first occurs in a Bernoulli sequence with 
probability p for success: P[Y, =k]=q k-1p, where q= 1-—p. Since the moment 
generating function is pe*/(1 Z ge’), ELY,) = p~' and Var[Y,] = gp ~2 If k—1 dis- 
tinct items have thus far entered the sample, the waiting time ont the next distinct 
one enters is ated as Y, as p=(n—k+1)/n. Therefore, S, can be repre- 
sented as Ly; X, for independent summands X,, distributed as ye -k+1)/n: Since 
r, ~ pn, the mean and variance above give 


fn 


m, =E[S,J= ¥ (1- tat) vaf = 


k=1 


SECTION 27. THE CENTRAL LIMIT THEOREM 363 


and 
Fa a os ~% p 
ge y A (1-5) ~n -aa 
k=1 0 (1-x) 


Lyapounov’s theorem applies for ô = 2, and to check (27.16) requires the inequal- 
ity 


(27.17) E|(¥,-p-')'| < Kp‘ 


for some K independent of p. A calculation with the moment generating function 
shows that the left side is in fact qo “(1 + 7q + q7). It now follows that 


5 k-1\~* 
<KY, (1- ==) 
k=1 


p kg | é n š 
(27.18) i. e| (xu iT) 
k=1 


~ Kf’ —t 
0 C52 


Since (27.16) follows from this, Theorem 27.3 applies: (S,,— m,,)/s, = N. a 


Dependent Variables * 


The assumption of independence in the preceding theorems can be relaxed 
in various ways. Here a central limit theorem will be proved for sequences in 
which random variables far apart from one another are nearly independent 
in a sense to be defined. 

For a sequence X,, X,,... of random variables, let a, be a number such 
that 


(27.19) |P( ANB) —P(A)P(B)| <a, 


for AE o( X,,..., X,), BE G(X, in Xina e and hah ae k Sappose 
that a, 0, the idea being that X, and X,,, are then approximately 
independent for large n. In this case the sequence {X,} is said to be 
a-mixing. If the distribution of the random vector (X,,, X,, 4 ++» Xn +) does 
not depend on n, the sequence is said to be stationary. 


Example 27.6. Let {Y,} be a Markov chain with finite state space and 
positive transition probabilities p,,, and suppose that X, =f(Y,), where f is 
some real function on the state space. If the initial probabilities P; are the 
stationary ones (see Theorem 8.9), then clearly {X,,} is stationary. Moreover 
by (8.42), |p! -p;i <p", where p<1. By (8.11), PLY,=i,,....% =i, 


Yeon m dorea Meraas Phil 72) Pidan ne Pis- PER Pini «++ Pip Which differs 


“This topic may be omitted. 


364 CONVERGENCE OF DISTRIBUTIONS 


from PIY, = i: s Dy = iglPlYean =Jov++ +> Yernts = Ji) by at most 
Di Pi i++. Pirit P Phen a Pya Ht follows by addition that, if s is the number 
of states, then for sets of the form A=[Y,..., Y,)EH] and B= 
Wans Yoana) EH’ I (27.19) holds with a, = sp". These sets (for k and 
n fixed) form fields generating o-fields which contain o(X,,..., X,) and 
O(X pays Xeanairs++)s respectively. For fixed A the set of B satisfying (27.19) 
is a monotone class, and similarly if A and B are interchanged. It follows by 
the monotone class theorem (Theorem 3.4) that {X,,} is a@-mixing with 
a, = Sp”. a 


The sequence is m-dependent if (X,,...,X,) and (Xy4n9--+5 Xpinsd are 
independent whenever n >m. In this case the sequence is a-mixing with 
a, =0 for n >m. In this terminology an independent sequence 1s 0-depen- 
dent. 


Example 27.7. Let Y,,Y,... be independent and identically distributed, 
and put X,=f(Y,,...,Y,.,,) for a real function f on R”*'. Then {X,} is 
stationary and m-dependent. & 


Theorem 27.4. Suppose that X., X5,... is stationary and a-mixing with 
a, =O(n->) and that E[X,)=0 and E[X)7]<o. If 8,=X,+ --- +X,, 
then 


(27.20) n~'Var[S,] >o? =E|X?] +2 X E[X,X,,,], 
k=1 


where the series converges absolutely. If o > 0, then S, J/ovn >N. 


The conditions «„=0O(n~°) and E[X}?] < œ are stronger than necessary; 
they are imposed to avoid technical complications in the proof. The idea of 
the proof, which goes back to Markov, is this: Split the sum X, + --- +X, 


into alternate blocks of length b, (the big blocks) and /, (the little blocks). 
Namely, let 


(27.21) Un =XG-syo,+1 417 °°" +X i 1b, +1,)+b,? l sien, 


where r, is the largest integer i for which (i — 1b, +/,) +b, < n. Further, 
let 

Vii = Ku +1.) 4b, +1 afi ta HAr 41,9 lsi<h; 

fon = Mie iirlar beat | a 


n 


(27.22) 


Then S, = Li Uni + Lix,V,,,, and the technique will be to choose 
small enough that 2;V,, is small in comparison with £;U,„; but large 


SECTION 27. THE CENTRAL LIMIT THEOREM 365 


that the U,,; are nearly independent, so that Lyapounov’s theorem can be 
adapted to prove ŁU. asymptotically normal. 


i~ni 


Lemma 2. If Y is measurable o(X,,..., X,,) and bounded by C, and if Z is 
measurable o( X,.,,, Xp4n419+--) and bounded by D, then 


(27.23) | E[YZ] - E[Y ]E[Z]| <4CDa,. 


Proor. It is no restriction to take C = D = 1 and (by the usual approxi- 
mation method) to take Y= Ljy,J,, and Z = Ł;z;Iş, simple (|y;l,|z;| < 1). If 
d,, = P(A, A B;) — P(A,)P(B;), the left side of (27. 23) is Diz; dyl Take £, 
to be t Lor Ki as L;z;d; is positive or not; now take 7; to be +1 or —1 as 
Ł;é;d;; is positive or not. Then 


dvd Zj ij 


J 


z EDD 
i Jj 


radii 


= 1; é4;;= 2 ena: 
J i i,j 
Let AY [B9] be the union of the A; [B,] for which €;= +1 [n;= +1], and 


let AY) = Q —AM [BY = N — BJ. Then 


Lémdys L|P(AM NB) - PA)P(B)| < 4a, a 


i, 


Lemma 3. If Y is measurable o(X,,...,X,) and E[Y*]<C, and if Z is 
measurable 0( X;,4,5X,4n41+--) and E[Z*] < D, then 


(27.24) |E[YZ] - E[Y ]E[Z]| < 8(1 +C + D)a!⁄. 


Proor. Let Y,= YI], (IY) <a) Y = My > ap Zo= Lh iz1< ap Z= Zlyzi> ay 
By Lemma 2, |E[Y,Z,] — ELY, IEIZ, ]| < 4a°a„. Further, 


|E[YoZ,] - E[Y,]E[Z,]]| < E[|¥o- E[Yo]| |Z, - E[Z,]|] 
<2a-2E[|Z,l] <4aE[|Z,|-1Z,/al’] < 4D/a?. 
Similary, |E[Y, Zo] — EL[Y,JE[Z]| < 4C /a?. Finally, 


| EL Y,Z,]) — EL Y,)£[Z,]|< Var'7[Y,) Var'7[ 29 <2'"|y7]E'4[24 
sE [Y/R EA 7) < C'?D!/2 A 


Adding these inequalities gives 4a°a„ + 4(C + D)a~?+C!/7D'/2q-2 as a 


<= CONVERGENCE OF DISTRIBUTIONS 


bound for the left side of (27.24). Take a =a~'/4 and observe that 4 + 4(C + 
D) + C1⁄2D1/2 < 4+ 4(C!/2 + D2)? < 44+ BC + D). a 


PRooF OF THEOREM 27.4. By Lemma 3, |E[X,X,,,]| < 80 + 
ys ‘Dal’? = O(n~*/?), and so the series in (27.20) converges absolutely. If 
=E[X, Xırs then by stationary E[S} re CM i(n —k)p, and 
icine lo? —n'E[S2]| < 2D3_,|o,|+ 207 LPT LZ-;|p,|; hence (27.20). 
By stationarity again, 


ETSY <4!n E| X X14: X 414; 


where the indices in the sum are constrained by i, j,k >O and i+j+k <n. 
By Lemma 3 the summand is at most 


a1 elt) +e elas 
which is at most? 
8(1+ E[ Xf] + E[ X}?])aj/ = Ka: 
Similarly, K,a}/? is a bound. Hence 


E[S4] <4!n? YK, min{a!/?, a}/?) 


i,k>0 
i+tk<n 


<K,n? E al/=K,n? D (k+ tal’? 
k=0 


O<isk 
Since a, = O(k~°), the series here converges, and therefore 
(27.25) E|S4] < Kn? 


for some K independent of n. 


Let b,=\|n’/*| and J, =|n'/4]. If r, is the largest integer i such that 
(i — 1b, +1,) +b, <n, then 


(27.26) ban o n A pe lle 


Consider the random variables (27.21) and (27.22). By (27.25), (27.26), and 


'E|XYZ|s EAX PPAP? < E'A Ppr 


SECTION 27. THE CENTRAL LIMIT THEOREM 367 


stationarity, 


lA 


(27.25) and (27.26) also give 


» ra Ù, 
efo tn? € ‘On 


2 
ENE Sat. 
oC 
separ Li V,,/ovn = 0, and by Theorem 25.4 it suffices to prove that 
Lt U7 /e Vn eae N. 

Let U; 1 <i<r,, be independent random variables having the distribu- 
tion common to the U_;. By Lemma 2 extended inductively the characteristic 
functions of L7"_,U Ujan and of L’,U!./avn differ by at most’ 16r na, 
Since a, = O(n~ 5y this difference is Ow by (27.26). 

The characteristic function of £?» ;U,;/aVn will thus approach e~‘ boa 
that of L’,U'/ovn does. It theretore remains only to show that 

n A Sant = N. Now E[IU;/;|?] = ELUŻ] ~ bo? by (27.20). Further, 
E[IU;;|f] < Kb? by (27.25). Lyapounov’s condition (27.16) for 5 = 2 therefore 
follows because 


er ee 
A Tea oe 


Example 27.8. Let {Y,} be the stationary Markov process of Example 27.6. 
Let f be a function on the state space, put m=2,p,f(i), and define 
X,, =f(¥,) —m. Then {X,} satisfies the conditions of Theorem 27.4. If 
Pij = 6; ;D; — PP; + ADDA =p); then the T? in (27.20) is ri (FO = 
mx f(j)—m), and L7_,f(Y,) is approximately normally distributed with 
mean nm and standard deviation ayn. 

If fli)=6,,, then LZ_,f(Y,) is the number of passages through the 
state i, in the first n steps of the process. In this case m=p,, and o? 

pi = -P +: 2 Di duk -Ap - Pi). s 


Example 27.9. Ifthe X, are stationary and m-dependent and have mean 
0, Theorem 27.4 applies and o* = E[X?]+ 22 ,E[X,X,,,]. Example 27.7 
is a case in point. Taking m = 1 and Ka, y)=x —y in that example gives an 
instance where a” = 0. E 


‘The 4 in (27.23) has become 16 to allow for splitting into real and imaginary parts. 


368 


CONVERGENCE OF DISTRIBUTIONS 


PROBLEMS 


27.1. 


27.2. 


27.3. 


27.4. 


275. 


27.6. 


27.7. 


27.8. 


27.9. 


27.10. 


27.11. 


27.12. 


27.13. 


Prove Theorem 23.2 by means of characteristic functions. Hint: Use (27.5) to 
compare the characteristic function of Ey ,Z,, with exp[L, Pale" — D). 


If {X,,} is independent and the X,, all have the same distribution with finite 
first moment, then n~'S, > ELX,] with probability 1 (Theorem 22.1), so that 
n` 'S, = E[X,]. Prove the latter fact by characteristic functions. Hint: Use 
(27.5). 


For a Poisson variable Y, with mean A, show that (Y, — A)/ VA =N as À > o. 
Show that (22.3) fails for t = 1. 


Suppose that |X,,,| < M„ with probability 1 and M,,/s,, > 0. Verify Lyapounov’s 
condition and then Lindeberg’s condition. 


Suppose that the random variables in any single row of the triangular array are 


identically distributed. To what do Lindeberg’s and Lyapounov’s conditions 
reduce? 


Suppose that Z,, Z,,... are independent and identically distributed with mean 
0 and variance 1, and suppose that X,,,=0,,,Z,. Write down the Lindeberg 
condition and show that it holds if max, <, o, =o0(Lz_ 10,4). 

Construct an example where Lindeberg’s condition holds but Lyapounov’s 
does not. 


22.97 Prove a central limit theorem for the number R,, of records up to 
time n. 


6.31 Let S, be the number of inversions in a random permutation on n 
letters. Prove a central limit theorem for S,,. 


The 6-method. Suppose that Theorem 27.1 applies to {X,}, so that 
yno- (X, —c) = N, where X, = n 'E;-1X,. Use Theorem 25.6 as in Exaile 
ple 27.2 to show that, if f(x) has a nonzero derivative at c, then vn (f(X,)- 
flo) /alf(ol=>=N: X, is approximately normal with mean c and standard 
deviation o/ yn, and f(X,,) is approximately normal with mean f(c) and 
standard deviation | f’(c)|a/ Yn . Example 27.2 is the case f(x) =1/.. 


Suppose independent X, have density |x|? outside (—1, +1). Show that 
(nlogn)~'/75, =N, 


There can be asymptotic normality even if there are no moments at all. 
Construct a simple example. 


Let d,(w) be the dyadic digits of a point w drawn at random from the unit 
interval. For a k-tuple (u,,...,u,) of 0’s and 1’s, let N,(u,,...,U,3@) be the 
number of m <n for which (d,,(@),...4m44—(@)) = (uy,..., Uy), Prove a 


central limit theorem for N,(u,,...,u,;@). (See Problem 6.12.) all r 


SECTION 27. THE CENTRAL LIMIT THEOREM 369 


27.14. 


27.15. 


The central limit theorem for a random number of summands. Let X,, X2,... be 
independent, identically distributed random variables with mean 0 and vari- 
ance o°, and let S, =X, ++- +X,. For each positive t, let v, be a random 
variable assuming positive integers as values; it need not be independent of the 
X,,. Suppose that there exist positive constants a, and 0 such that 


(27.27) =N. 


IA 


(a) Show that it may be assumed that 0 = 1 and the a, are integers. 
(b) Show that it suffices to prove the second relation in (27.27). 

(c) Show that it suffices to prove (S,, — S,)/ Va, = 0. 

(d) Show that 


PIS., Sa > eya, | < P|», —a,|>e7a,| 


+P] max IS, -S,,|>eVa, ; 


|k —a,|< ea, 


and conclude from Kolmogorov’s inequality that the last probability is at most 
2eo?. 


21.21 23.10 23.141 A central limit theorem in renewal theory. Let 
X,, X2,... be independent, identically distributed positive random variables 
with mean m and variance a’, and as in Problem 23.10 let N, be the 
maximum n for which S, < t. Prove by the following steps that 


N,-—tm™' 


7 (Win 372 =N, 


(a) Show by the results in Problems 21.21 and 23.10 that (Sy — t)/ Vt = 0. 
(b) Show that it suffices to prove that 


N, — Sy,m7! "e — (Sy, — mN,) äh 
per Vy PAY, Al tne ae ` 


(c) Show (Problem 23.10) that N,/t = m™', and apply the theorem in Prob- 
lem 27.14. 


370 


27.16. 


27.17. 


27.18. 


27.19. 


27.20. 


27.21. 


CONVERGENCE OF DISTRIBUTIONS 


Show by partial integration that 
1 ” 2 ey 2 
27.28 ogee | O°" TEU eee e 
( ) 27 J, v2r * 
as x > ©, 


t Suppose that X,, X;,... are independent and identically distributed with 
mean 0 and variance 1, and suppose that a, — ©. Formally combine the 
central limit theorem and (27.28) to obtain 


1 1 2 2 
(27.29) PS >a Vn | do Ee aap n = eT% + on)/2 
ú 5 V27r a, 


where ¢, > 0 if a, > œ. For a case in which this does hold, see Theorem 9.4. 


21.27  Stirling’s formula. Let S, =X, + ++: +X,,, where the X, are indepen- 
dent and each has the Poisson distribution with parameter 1. Prove succes- 


sively: 
(2) -e E (28) 
Vn i= yn | k! n! : 


E -nA 
(b a >N. 
| ra 


Gul 
vn 
(d) n! ~ V2rn"tO/%e-", 


(a) E 


(c) E 


Let |,(w) be the length of the run of 0’s starting at the nth place in the dyadic 
expansion of a point w drawn at random from the unit interval; see Example 
41. 

(a) Show that /,,/,,... is an a-mixing sequence, where a, = 4/2". 


(b) Show that ©7_,/, is approximately normally distributed with mean n and 
variance 6n. 


Prove under the hypotheses of Theorem 27.4 that S,,/n — 0 with probability 1. 
Hint: Use (27.25). 


26.1 26.297 Let X,, X>,... be independent and identically distributed, and 
suppose that the distribution common to the X,, is supported by [0,27] and is 
not a lattice distribution. Let S, =X, + -++ +X,, where the sum is reduced 
modulo 27r. Show that S, = U, where U is uniformly distributed over [0,27]. 


SECTION 28. INFINITELY DIVISIBLE DISTRIBUTIONS 371 
SECTION 28. INFINITELY DIVISIBLE DISTRIBUTIONS” 


Suppose that Z, has the Poisson distribution with parameter A and that 
X,15---sXmn are independent and P[X,, =1]=A/n, P[X,,=0]=1—A/n. 
According to Example 25.2, X,, + ++: +X,,,, > Za: This contrasts with the 
central limit theorem, in which the limit law is normal. What is the class of all 
possible limit laws for independent triangular arrays? A suitably restricted 


form of this question will be answered here. 


Vague Convergence 


The theory requires two preliminary facts about convergence of measures. Let w, and 

u be finite measures on (R', #'). If w,(a,b] > ula, b] for every finite interval for 

which ufa} = u{b} = 0, then w,, converges vaguely to p, written uw, >, w. If w, and y 

are probability measures, it is not hard to see that this is equivalent to weak 

convergence: u„ => pu. On the other hand, if uw, is a unit mass at n and u(R')=0, 

then uw, >, u, but w, = uw makes no sense, because u is not a probability measure. 
The first fact needed is this: Suppose that u„ >, u and 


(28.1) supu, ( R') < »; 
then 
(28.2) ffdun> [faw 


for every continuous real f that vanishes at +œ in the sense that lim,,,_,. f(x) =0. 
Indeed, choose M so that n(R') <M and u,(R!) <M for all n. Cree e, choose a 
and b so that {a} = u{b} = 0 and |f(x)|<e/M if x€A = (a, b]. Then | ,-fdu,|<e 
and |/,-fdul<e. If uw(A)>0, define p(B) = u(BOA)/pCA) and v (B)= iv (B N 
A)/u,{A). It is easy to see that v, =v, so that [fdv, > [fdv. But then |fafdu = 
Jafdu| < e for large n, and hence | {fdu,, — [fdul < 3e for large n. If u(A)=0, then 
Safda, > 0, and the argument is even simpler. 

The other fact needed below is this: Zf (28.1) holds, then there is a subsequence 
{4n,} and a finite measure u such that u, >, u as k >, Indeed, let F(x) = 
Lu, (—%, x]. Since the F, are uniformly bounded because of (28.1), the proof of Helly’s 
theorem Shows there exists a subsequence {F,,,} and a bounded, nondecreasing, 
right-continuous function F such that lim, F,, (+) = F(x) at continuity points x of F. 


Mes the measure for which (a, b] = F(b) — F(a) (Theorem 12.4), then clearly 
ny c H. ; 


The Possible Limits 


Let Xaoo Xar W=1,2,..., be a triangular array as in the prec 
section. The random variables in each row are independent, the mea 


“This section may be omitted. 


"a CONVERGENCE OF DISTRIBUTIONS 


and the variances are finite: 
(28.3) E[ Xal = 0, on SEIX ; 5 = Z Tpk 
l 


Assume s; > 0 and put S, =X,, + ++: +X, Here it will be assumed that 
the total variance is bounded: 


(28.4) sups? < o. 


n 


In order that the ¥„, be small compared with S,, assume that 


(28.5) lim max Oyy 0. 
n ks 


The arrays in the preceding section were normalized by replacing X,, by 
X,,x/S,- This has the effect of replacing s, by 1, in which case of course (28.4) 
holds, and (28.5) is the same thing as max, o,4,/s2 > 0. 

A distribution function F is infinitely divisible if for each n there is a 
distribution function F, such that F is the n-fold convolution F, * --- * F, 
(n copies) of F,,. The class of possible limit laws will turn out to consist of the 
infinitely divisible distributions with mean 0 and finite variance.’ It will be 
possible to exhibit the characteristic functions of these laws in an explicit 
way. 


Theorem 28.1. Suppose that 
ine 1 
(28.6) olt) = exp f(e" ~ 1 ite) Gu(dr), 


where p is a finite measure. Then ¢ is the characteristic function of an infinitely 
divisible distribution with mean 0 and variance u(R'). 


By (26.4,), the integrand in (28.6) converges to —t?/2 as x > 0; take this 
as its value at x = 0. By (26,4,), the integrand is at most ¢?/2 in modulus and 
so is integrable. 

The formula (28.6) is the canonical representation of g, and u is the 
canonical measure. 


Before proceeding to the proof, consider three examples. 


Example 28.1. If « consists of a mass of ø? at the origin, (28.6) is 
e77" /*, the characteristic function of a centered normal distribution F. It is 
certainly infinitely divisible—take F, normal with variance o?/n. a 


tThere do exist infinitely divisible distributions without moments (see Problems 28.3 and 28.4), 
but they do not figure in the theory of this section. 


SECTION 28. INFINITELY DIVISIBLE DISTRIBUTIONS 373 


Example 28.2. Suppose that u consists of a mass of Ax? at x #0. Then 
(28.6) is exp A(e'"* — 1 — itx); but this is the characteristic function of x(Z, — 
A), where Z, has the Poisson distribution with mean A. Thus (28.6) is the 
characteristic function of a distribution function F, and F is infinitely 
divisible—take F, to be the distribution function of x(Z, ,,,—A/n). m 


Example 28.3. If ot) is given by (28.6) with u, for the measure, and if 
u=Li_,m,;, then (28.6) is i(t)... p(t). It follows by the preceding two 
examples that (28.6) is a characteristic function if u consists of finitely many 
point masses. It is easy to check in the preceding two examples that the 
distribution corresponding to g(t) has mean 0 and variance u(R'), and since 
the means and variances add, the same must be true in the present example. 

I 


PROOF OF THEOREM 28.1. Let w, have mass W2 “O + 1)2~*] at j2~“ 
for j=0, +1,..., 427%. Then p,-—, n. As observed in Example 28.3, if 
o,(t) is (28.6) with ux in place of u, then y, is a characteristic function. For 
each ¢ the integrand in (28.6) vanishes at +œ; since sup, u,(R')<®, 
g(t) > g(t) follows (see (28.2)). By Corollary 2 to Theorem 26.3, g(t) is 
itself a characteristic function. Further, the distribution corresponding to 
p,(t) has second moment y,(R'), and since this is bounded, it follows 
(Theorem 25.11) that the distribution corresponding to g(t) has a finite 
second moment. Differentiation (use Theorem 16.8) shows that the mean is 
¢'(0)=0 and the variance is —¢’(0)=p(R'). Thus (28.6) is always the 
characteristic function of a distribution with mean 0 and variance p(R'). 

If w(t) is (28.6) with u/n in place of u, then g(t)= y(t), so that the 
distribution corresponding to (t) is indeed infinitely divisible. A 


The representation (28.6) shows that the normal and Poisson distributions 
are special cases in a very large class of infinitely divisible laws. 


Theorem 28.2. Every infinitely divisible distribution with mean 0 and finite 
variance is the limit law of S, for some independent triangular array satisfying 
(28.3), (28.4), and (28.5). @ 


The proof requires this preliminary result: 


Lemma. 1f X and Y are independent and X + Y has a second moment, 
then X and Y have second moments as well. 


Proor. Since X*+Y*<(X+Y)’+2|XY\|, it suffices to prove |XY| 
integrable, and by Fubini’s theorem applied to the joint distribution of X and 
Y it suffices to prove |X| and |Y| individually integrable. Since |Y| <\|x 
Ix+Y\, E[|Yll=% would imply Ellx+Yl]= for each x; by fi 


374 CONVERGENCE OF DISTRIBUTIONS 


theorem again E[|Y|]= © would therefore imply E[|X + Y|] =, which is 
impossible. Hence E[|Y |] < ©, and similarly E[|X |] < ©. a 


PROOF OF THEOREM 28.2. Let F be infinitely divisible with mean 0 and 
variance o2. If F is the n-fold convolution of F, then by the lemma 
(extended inductively) F, has finite mean and variance, and these must be 0 
and o”/n. Take r, =n and take X,,,,..., X,, independent, each with distri- 
bution function F,. S 


Theorem 28.3. If F is the limit law of S, for an independent triangular 
array satisfying (28.3), (28.4), and (28.5), then F has characteristic function of 
the form (28.6) for some finite measure w. 


Proor. The proof will yield information making it possible to identify 
the limit. Let ¢,,,(t) be the characteristic function of X,,,. The first step is to 
prove that 


(28.7) TI e(t) ~exp ¥ (¢4(t)—1) 0 
= k=1 


for each t. Since |z| <1 implies that |e?~'|=e®*?—! <1, it follows by (27.5) 
that the difference 6,(t) in (28.7) satisfies |5,(t)| < Li, le,,(t) — 
exp(¢,,,(t) — DI. Fix t. If ¢,,(t) -1=98,,, then |0,,|< t?o2,/2, and it follows 
by (28.4) and (28.5) that max,|6,,|—20 and ¥,16,,|=O(). Therefore, for 
sufficiently large n, |6,(t)|<¥,|1 + 6,, = e°] <e?D,16,,|° < e? max,|0 
Ao alby (27.15). Hence (28.7). 

If F, is the distribution function of X,,, then 


bgt 


(ei = Midis (Ca) 


È (onl = Dakr, 


> PER = 1 = itt) dF aCA. 


Let u,„ be the finite measure satisfying 


(28.8) Halo, x)= Df ydF.(y), 
k=1"Y Sx 
and put 


(28.9) y,(t) = exp f (el asi - ix) Fru, ( de). 


SECTION 28. INFINITELY DIVISIBLE DISTRIBUTIONS 375 


Then (28.7) can be written 


(28.10) Tl enli) -elt > 0. 


By (28.8), w,(R')=s7, and this is bounded by assumption. Thus (28.1) 
holds, and some subsequence (un) converges vaguely to a finite measure pw. 
Since the integrand in (28.9) vanishes at +, , (t) converges to (28.6). But, 
of course, lim, ¢,(t) must coincide with the characteristic function of the 
limit law F, which exists by hypothesis. Thus F must have characteristic 
function of the form (28.6). & 


Theorems 28.1, 28.2, and 28.3 together show that the possible limit laws 
are exactly the infinitely divisible distributions with mean 0 and finite vari- 
ance, and they give explicitly the form the characteristic functions of such 
laws must have. 


Characterizing the Limit 


Theorem 28.4. Suppose that F has characteristic function (28.6) and that 
an independent triangular array satisfies (28.3), (28.4), and (28.5). Then S,, has 
limit law F if and only if „>, m, where p,, is defined by (28.8). 


Proor. Since (28.7) holds as before, S, has limit law F if and only if 
g,(t) (defined by (28.9)) converges for each t to g(t) (defined by (28.6)). If 
u, >, u, then ¢,(t) > g(t) follows because the integrand in (28.9) and (28.6) 
vanishes at + and because (28.1) follows from (28.4). 

Now suppose that ¢,(t) > g(t). Since 4,,(R') = s? is bounded, each subse- 
quence {u,,} contains a further subsequence {Hn converging vaguely to 
some v. If it can be shown that v necessarily coincides with p, it will follow 
by the usual argument that w, —, m. But by the definition (28.9) of ¢,(¢), it 
follows that g(t) must coincide with y(t) = exp fpi(e”* — 1 — itx)x~7v(dx). 
Now ¢'(t) = ig(t) /pie* — 1)x~'w(dx), and similarly for y’(t). Hence p(t) = 
w(t) implies that fpie!* — 1)x~'v(dx) = fale’ — 1)x~'u(dx). A further 
differentiation gives [,ie!*u(dx) = fpie*v(dx). This implies that 4(R') = 
v(R!), and so u =v by the uniqueness theorem for characteristic functions. 

E 


Example 28.4. The normal case. According to the theorem, $„ = N if and 
only if 4, converges vaguely to a unit mass at 0. If s? = 1, this holds if and 
only if Li fyx)> eX” AF,4(x) > 0, which is exactly Lindeberg’s condition. 


Example 28.5. The Poisson case. Let Zp.. Zar, be an indepen¢ 
triangular array, and suppose X,,, = Z„k — M„ę Satisfies the conditions of 


376 CONVERGENCE OF DISTRIBUTIONS 


theorem, where m,,=E[Z,,]. If Z, has the Poisson distribution with 
parameter A, then L, X,, > Z, —A if and only if w,, converges vaguely to a 
mass of A at 1 (see Example 28.2). If s? >A, the requirement is ull —e, 
1+e]—- A, or 


(28.11) Èa (Zak SMa) dP 0 


k (Zak Mik USE 


for positive e. If s? and L,m,, both converge to A, (28.11) is a necessary and 
sufficient condition for L,Z,,, = Z,. The conditions are easily checked under 
the hypotheses of Theorem 23.2: Z,, assumes the values 1 and O with 
probabilities p,, and 1—p,,, bk Pank 2 A, and max, Png — 9. a 


PROBLEMS 


28.1. Show that u„ „u implies u(R') <liminf, w,(R'). Thus in vague conver- 
gence mass can “escape to infinity” but mass cannot “enter from infinity.” 


28.2. (a) Show that u„ >, u if and only if (28.2) holds for every continuous f with 
bounded support. 


(b) Show that if u„ >, p but (28.1) does not hold, then there is a continuous f 
vanishing at +œ for which (28.2) does not hold. 


28.3. 23.77 Suppose that N,Y,,Y,,... are independent, the Y, have a common 
distribution function F, and N has the Poisson distribution with mean a. Then 
S=Y,+ --: +Y, has the compound Poisson distribution. 

(a) Show that the distribution of S is infinitely divisible. Note that S may not 
have a mean. 

(b) The distribution function of S is Ln=0€ “a"F"*(x)/n!, where F”* is the 
n-fold convolution of F (a unit jump at 0 for n = 0). The characteristic function 
of S is expaf? {e'* — 1) dF(x). 

(c) Show that, if F has mean 0 and finite variance, then the canonical measure 
p in (28.6) is specified by w( A) =af,x? dF(x). 


28.4. (a) Let v be a finite measure, and define 


oo n 2 
(28.12) oft) =el ine + | (e*—1 PX = 


TIPA Hisa], 


where the integrand is —/?/2 at the origin. Show that this is the characteristic 
function of an infinitely divisible distribution. 

(b) Show that the Cauchy distribution (see the table on p. 348) is the case 
where y=0 and v has density 7 '(1+x*)~' with respect to Lebesgue 
measure. 


28.5. Show that the Cauchy, exponential, and gamma (see (20.47)) distributions are E 
infinitely divisible. oan 


SECTION 28. INFINITELY DIVISIBLE DISTRIBUTIONS 377 


28.6. 


28.7. 


28.8. 


28.9. 


Find the canonical representation (28.6) of the exponential distribution with 
mean 1: 

(a) The characteristic function is /fe'"*e * dx = (1 — it) ' = g(t). 

(b) Show that (use the principal branch of the logarithm or else operate 
formally for the moment) d(log g(t))/dt = ip(t) = iffe!'*e * dx. Integrate with 
respect to ¢ to obtain 


— 


1 Sai 
(28.13) ET] = exp f (ela DE 


dx. 
x 


Verify (28.13) after the fact by showing that the ratio of the two sides has 
derivative 0. 

(c) Multiply (28.13) by e`" to center the exponential distribution at its mean: 
The canonical measure u has density xe * over (0,0). 


* If X and Y are independent and each has the exponential density e *, 
then X — Y has the double exponential density $e"! (see the table on p. 348). 
Show that its characteristic function is 


F itx 9 1 —|x 
aE =exp f (e oare] ~ it) Fla ixl dx. 


T Suppose X,, X5,... are independent and each has the double exponential 
density. Show that L*_,X,,/n converges with probability 1. Show that the 
distribution of the sum is infinitely divisible and that its canonical measure has 
density |xle~"!70. —e7#!) = OF _Ixlew 


26.87 Show that for the gamma density e *x“~!/I(u) the canonical mea- 
sure has density uxe * over (0, œ). 


The remaining problems require the notion of a stable law. A distribution function 
F is stable if for each n there exist constants a, and b,, a, >0, such that, if 


Ke 


+ -e 


28.10. 


28.11. 


28.12. 


28.13. 


28.14. 


,X, are independent and have distribution function F, then a, 1X, 


+X,,) +b, also has distribution function F. 


Suppose that for all a,a’,b,b’ there exist a”, b” (here a,a’,a” are all positive) 
such that F(ax +b)» F(a'x + b’) = F(a"x + b"). Show that F is stable. 


Show that a stable law is infinitely divisible. 
Show that the Poisson law, although infinitely divisible, is not stable. 
Show that the normal and Cauchy laws are stable. 


28.107 Suppose that F has mean 0 and variance 1 and that the dependence 
of a",b" on a,a',b,b' is such that 


(ilil) 


Show that F is the standard normal distribution. 


378 CONVERGENCE OF DISTRIBUTIONS 


28.15. (a) Let Y,, be independent random variables having the Poisson distribution 
with mean cn%/|k|''%, where c > 0 and 0 <a <2. Let Z,=n 'LH__,2kY,, 
(omit k=0 in the sum), and show that if c is properly chosen then the 
characteristic function of Z,, converges to e "", 
(b) Show for 0 <æ <2 that e |" is the characteristic function of a symmetric 
stable distribution; it is called the symmetric stable law of exponent a. The case 
a = 2 is the normal law, and a = 1 is the Cauchy law. 


SECTION 29. LIMIT THEOREMS IN R* 


If F, and F are distribution functions on R*, then F, converges weakly to F, 
written F, >F, if lim, F(x) = F(x) for all continuity points x of F. The 
corresponding distributions w, and u are in this case also said to converge 
weakly: u„ = u. If X,, and X are k-dimensional random vectors (possibly on 
different probability spaces), X, converges in distribution to X, written 
X,, =X, if the corresponding distribution functions converge weakly. The 
definitions are thus exactly as for the line. 


The Basic Theorems 


The closure A~ of a set in R* is the set of limits of sequences in A; the 
interior is A° = R* — (R* —A)-; and the boundary is 3A = A~— A°. A Borel 
set A is a u-continuity set if (ðA) = 0. The first theorem is the k-dimen- 
sional version of Theorem 25.8. 


Theorem 29.1. For probability measures MH, and u on (R*, B*), each of 
the following conditions is equivalent to the weak convergence Of u,„ to u: 


(i) lim, [fdu,, = [fdu for bounded continuous is 
(ii) lim sup, u„(C) < u(C) for closed C; 
(ii) lim inf, 4,(G) > «(G) for open G; 
(iv) lim, 2,(A) = (A) for w-continuity sets A. 


Proor. It will first be shown that (i) through (iv) are all equivalent. 
(i) implies (ii); Consider the distance dist(x, C) = infllx — yl: y € C] from x 
to C. It is continuous in x. Let 


1 if t <0, 
e(t)=si l= Oe repr, 
0 iti es. 


SECTION 29. LIMIT THEOREMS IN R*é 379 


Then f(x) = ¢(dist(x, C)) is continuous and bounded by 1, and f(x) 1 Ic(x) 
as j tT because C is closed. If (i) holds, then limsup, w,(C) < lim, ff; du, 
= ff,du. As j t~, [fjdu tl flc du = pC). 

(ii) is equivalent to (iii). Take C = R* —G. 

(ii) and (iii) imply Civ): From (ii) and (iii) follows 


u( A°) < lim infw,( 4°) < lim infw,( A) 
n n 


< lim supp,(A) < lim supu, (A )<u( A). 


Clearly (iv) follows from this. 

(iv) implies (i): Suppose that f is continuous and |f(x)| is bounded by K. 
Given e, choose reals ay <a, < --: <a, so that ag< -—K<K<a, a;— 
@,_,<e, and ulx: f(x)=a;]=0. The last condition can be achieved be- 
cause the sets [x: f(x)=a] are disjoint for different a. Put A,;=[x: 
a,_,<f(x)<a,]. Since f is continuous, A; Clx: a;_,<f(x)<a,] and 
AS >| x: @,_, <f(x)< a;l]. Therefore, 3A; C[x: f(x) =a, WO (Gy —erk 
and therefore u(0A,) = 0. Now | {fdu,, — L;_,a;4,,(A;)| < € and similarly for 
u, and L!_,a,u,(A;) > Lj_,a@;u(A,;) because of (iv). Since e was arbitrary, 
(i) follows. 

It remains to prove these four conditions equivalent to weak convergence. 

(iv) implies „= p: Consider the corresponding distribution functions. If 
S.=l[y: y,<x;, i=1,...,k], then F is continuous at x if and only if 
(ðS) = 0; see the argument following (20.18). Therefore, if F is continuous 
at x, F(x) =pn,(S,) > w(S,) = F(x), and F, >F. 

iL, = w implies (iii): Since only countably many parallel hyperplanes can 
have positive w-measure, there is a dense set D of reals such that alx: 
x;=d]=0 for d€D and i=1,...,k. Let Æ be the class of rectangles 
A =([x: a,;<x;<b;,i=1,...,k] for which the a; and the b, all lie in D. All 
2% vertices of such a rectangle are continuity points of F, and so Fi =>F 
implies (see (12.12)) that w,(A)=A,F,>A,F =p(A). It follows by the 
inclusion—exclusion formula that w,(B)— u(B) for finite unions B of ele- 
ments of Æ. Since D is dense on the line, an open set G in R* is a 
countable union of sets A,, in Z. But u(U,, <yA,,) =lim, UaU » < mw Am) 

<liminf, w,(G). Letting M > œ gives (iii), E 


Theorem 29.2. Suppose that h: R* — R! is measurable and that the set D, 
of its discontinuities is measurable.‘ If w= in R* and p( D,) =0, then 
p,h-! = ph"! in R, 


İThe argument in the footnote on p. 334 shows that in fact D, =€ R* always holds. 


= CONVERGENCE OF DISTRIBUTIONS 


PROOF. Let C be a closed set in R’. The closure (h~'C)~ in R* satisfies 
GWG ED, Uh~'C. If uw, = p, then part (ii) of Theorem 29.1 gives 


lim supu„h~'(C) < lim supana A REY) A ) 


n 


< a(D,) + a(k- 'C): 
Using (ii) again gives u„h ~! > wh! if w(D,) = 0. a 


Theorem 29.2 is the k-dimensional version of the mapping theorem—The- 
orem 25.7. The two proofs just given provide in the case k =1 a second 
approach to the theory of Section 25, which there was based on Skorohod’s 
theorem (Theorem 25.6). Skorohod’s theorem does extend to R*, but the 
proof is harder.' 

Theorems 29.1 and 29.2 can of course be stated in terms of random 
vectors. For example, X,, =X if and only if PLX € G] < liminf, P[X,, € G] 
for all open sets G. 

A sequence {,} of probability measures on (R4, 2%) is tight if for every e 
there is a bounded rectangle A such that »,(A) > 1-— e for all n. 


Theorem 29.3. Jf {,,} is a tight sequence of probability measures, there is a 
subsequence {u,,} and a probability measure u such that My, > p asi, 


Proor. Take S,=[y: y;<x,;, j<k] and F(x)=p,(S,). The proof of 
Helly’s theorem (Theorem 25.9) carries over: For points x and y Ine, 
interpret x <y as meaning x,<y,, u=1,...,k, and x<y as meaning 
X,<Y,, u= 1,...,k. Consider rational points r—points whose coordinates 
are all tek and by the diagonal method [A14] choose a sequence {n;} 
along which lim; F,(7)= G(r) exists for each such r. As before, define 
F(x) =inf[G(r): x< r]. Although F is clearly nondecreasing in each vari- 
able, a further argument is required to prove A ,F > 0 (see (12.12)). 

Given e and a rectangle A =(a,,b,]x ++: Ba BAY. choose a ô such 
that if z=(6,...,5), then for each of the 2% vertices x of A, x<r<x+z 
implies |F(x) — G(r)| < €/2%. Now choose rational points r and s such that 
a<r<at+z and b<s<b+z. If B=(r,,s,)X ++: X(r,,5,], then |A,F — 

A,G|l<e. Since A,G = lim; AjF, ,20 and e is pi Ls it follows that 
AF >0. 

With the present interpretation of the symbols, the proof of Theorem 25.9 
shows that F is continuous from above and lim, F, (x) = F(x) for continuity 
points x of F. 


tThe approach of this section carries over to general metric spaces; for this theory and its 
applications, see BILLINGSLEY, and BILLINGSLEY. Since Skorohod’s theorem is no easier in- 
than in the general metric space, it is not treated here. 


SECTION 29. LIMIT THEOREMS IN R“ 381 


By Theorem 12.5, there is a measure u on (R*, A*) such that w(A) =A ,F 
for rectangles A. By tightness, there is for given e a ¢ such that pw,[y: 
—t<y,<t,j<k]>1-—e for all n. Suppose that all coordinates of x exceed 
t: If r>x, then F(r)>1-—e and hence (r rational) G(r) > 1-— e, so that 
F(x)> 1 — e. Suppose, on the other hand, that some coordinate of x is less 
than —f: Choose a rational r such that x <r and some coordinate of r is 
less than —t; then F,(r) < e, hence G(r) < e, and so F(x) < e. Therefore, for 
every e there is a ¢ such that 


2l=-¢€, if x> tron ally, 
(29. ; ; 
(2) ae <e if x; < =t for some j. 


If B,=[y: —s<y,<x,, j<k], then u(S,)=lim, w(B,) =lim, A,F. Of 
the 2% terms in the sum A, F, all but F(x) go to 0 (s > œ) because of the 
second part of (29.1). Thus u(S,) = F(x).' Because of the other part of 
(29.1), u is a probability measure. Therefore, F, > F and pw, = 4. H 


Obviously Theorem 29.3 implies that tightness is a sufficient condition that 
each subsequence of {u,„} contain a further subsequence converging weakly 
to some probability measure. (An easy modification of the proof of Theorem 
25.10 shows that tightness is necessary for this as well.) And clearly the 
corollary to Theorem 25.10 now goes through: 


Corollary. Jf {w,,} is a tight sequence of probability measures, and if each 
subsequence that converges weakly at all converges weakly to the probability 
measure pw, then p, >p. 


Characteristic Functions 


Consider a random vector X = (X,,..., X) and its distribution u in R*. Let 
t-x =E% _,t,x, denote inner product. The characteristic function of X and 
of u is defined over R“ by 


(29.2) ol) = fe ul de) = Efe *). 


To a great extent its properties parallel those of the one-dimensional charac- 
teristic function and can be deduced by parallel arguments. 


‘This requires proof because there exist (Problem 12.10) functions F’ 


other than F f ic ; 
u(A)=A,F' holds for all rectangles A. for which 


NR 


382 CONVERGENCE OF DISTRIBUTIONS 


The inversion formula (26.16) takes this form: For a bounded rectangle 
A=[x: a, <x, <b, u <k] such that (ôA) = 0, 


ae eee 


1 k e uôu joy eitubu d 
29.3 A)+ th == ) y(t) dt, 
( ) pI ) To œ (27) J, UI Uy, 
where B,=[t E€ R“: |t,|< T, u <k] and dt is short for dt, +- dt,. To prove 


it, replace g(t) by the middle term in (29.2) and reverse the integrals as in 
(26.17): The integral in (29.3) is 


—it,a 


k —it,b 
1 e uu — e A o, 
I — f i E E dt 
’ Rk at ub 


(2:r)* u 


(dx). 


The inner integral may be evaluated by Fubini’s theorem in R*, which gives 


I= f. 7 [eae sents, - a,l) 


kKu=] T 


P A sr eae bu) ludo). 


Since the integrand converges to Tý- Wa, b (xu) (see (26.18)), (29.3) follows 
as in the case k = 1. 

The proof that weak convergence implies (iii) in Theorem 29.1 shows that 
for probability measures u and v on R* there exists a dense set D of reals 
such that 4(0A) = v(ðA) = 0 for all rectangles A whose vertices have coordi- 
nates in D. If (A) = v(A) for such rectangles, then u and v are identical by 
Theorem 3.3. 

Thus the characteristic function » uniquely determines the probability mea- 
sure u. Further properties of the characteristic function can be derived from 
the one-dimensional case by means of the following device of Cramér and 
Wold. For teR*, define h,; Ré>R' by h(x)=t-+x. For real a, [x: 
t-x <a] is a half space, and its -measure is 


(29.4) pl xit-x<a]=ph;'(-«, a]. 


By change of variable, the characteristic function of uh; ' is 


(29.5) fen, (dy) = fe uC de) 
= p(St,,...,5t,), sER', 


To know the -measure of every half space is (by (29.4)) to know each uh; ! 
and hence is (by (29.5) for s = 1) to know ¢(t) for every t; and to know the 


SECTION 29. LIMIT THEOREMS IN R* 383 


characteristic function o of u is to know mw. Thus p is uniquely determined by 
the values it gives to the half spaces. This result, very simple in its statement, 
seems to require Fourier methods—no elementary proof is known. 

If uw, =m for probability measures on R*, then g,(t)—> g(t) for the 
corresponding characteristic functions by Theorem 29.1. But suppose that the 
characteristic functions converge pointwise. It follows by (29.5) that for each 
t the characteristic function of u„h; ' converges pointwise on the line to the 
characteristic function of wh; '; by the continuity theorem for characteristic 
functions on the line then, u,h, ' => wh, '. Take the uth component of t to 
be 1 and the others 0; then the w,h,' are the marginals for the uth 
coordinate. Since {u,h, '} is weakly convergent, there is a bounded interval 
(a,,b,] such that p,[x E€ R: a, <x, <b,)=py,h, (a,,b,]>1—€/k for all 
n. But then w,(A)>1-—e for the bounded rectangle A =[x: a, <x, <b,, 
u=1,...,k]. The sequence {y,} is therefore tight. If a subsequence {un 
converges weakly to v, then ¢, (t) converges to the characteristic function of 
v, which is therefore ọ(t). By uniqueness, v = u, so that uw, =p. By the 
corollary to Theorem 29.3, uw, = u. This proves the continuity theorem for 
k-dimensional characteristic functions: u„ => u if and only if 9,(t) > g(t) for 
all t. 

The Cramér—Wold idea leads also to the following result, by means of 
which certain limit theorems can be reduced in a routine way to the 
one-dimensional case. 


Theorem 29.4. For random vectors X, =(X,,,...,X,,) and Y= 
(Y,,...,Y,), a necessary and sufficient condition for X,, = Y is that Y_t X,, 
=~ 5* 2 Y, foreach tips. tD in R" 


u=1"u 


Proor. The necessity follows from a consideration of the continuous 
mapping h, above—use Theorem 29.2. As for sufficiency, the condition 
implies by the continuity theorem for one-dimensional characteristic func- 
tions that for each (f,,...,t,) 


E| ehh -tuXnu] EN E | e24- Ya] 


for all real s. Taking s=1 shows that the characteristic function of X, 
converges pointwise to that of Y. u 


Normal Distributions in R* 


By Theorem 20.4 there is (on some probability space) a random vector 
X =(X,,..., X,) with independent components each having the standard 
normal distribution. Since each X, has density e~* /*/ V2m, X has density 


Sas CONVERGENCE OF DISTRIBUTIONS 


(see (20.25)) 


(29.6) f(x) = reper sales 


where |x|* = L*_,x?2 denotes Euclidean norm. This distribution plays the 
role of the standard normal distribution in R“. Its characteristic function is 


(29.7) e| Te’ = [beatae e 
u=1 u=1 
Let A=[a,,]Jbeakxk ae and put Y = AX, where X is viewed as a 
column vector. Since E[ X, Xg] = Page, the matrix } =[a,,,.] of the covariances 
of Y has entries g,,.= ELY.Y.]= beens lh Cine ALBIS ve AA’, where the 
prime denotes transpose. The matrix }, is symmetric and nonnegative defin- 
ite: Le Fur Xu X =|A'x|? > 0. View t also as a column vector with transpose 


„and note that t-x =t'x. The characteristic function of AX is thus 
(29.8) Ee" 4) = BeAr] = ela? 72 = ent ht/2. 


Define a centered normal distribution as any probability measure whose 
characteristic function has this form for some symmetric nonnegative 
definite >. 

If is symmetric and nonnegative definite, then for an appropriate 
orthogonal matrix U, U'ŁU =D is a diagonal matrix whose diagonal ele- 
ments are the eigenvalues of } and hence are nonnegative. If Dy is the 
diagonal matrix whose elements are the square roots of those of D, and if 
A = UD,, then È = AA’. Thus for every nonnegative definite > there exists a 
centered normal distribution (namely the distribution of AX) with covari- 
ance matrix Ł and characteristic function exp(— $t’Dr). 

If 2 is nonsingular, so is the A just constructed. Since X has density 
(29.6), Y=AX has, by the Jacobian transformation formula (20.20), density 
f(A~'x)\det A~'|. From X = AA’ follows |det A~'| = (det >)~'”2. Moreover, 
EEANN 94". go that JA el? = x >~'x. Thus the normal distribution 
has density (27)*/*(det )~'/? exp(— 4x’2~'x) if X is nonsingular. If 5 is 
singular, the A constructed above must be singular as well, so that AX is 
confined to some hyperplane of dimension k —1 and the distribution can 
have no density. 

By (29.8) and the uniqueness theorem for characteristic functions in R*, a 
centered normal distribution is completely determined by its covariance matrix. 
Suppose the off-diagonal elements of % are 0, and let A be the diagonal 
matrix with the o,!/? along the diagonal. Then X = A4’, and if X has the 
standard normal distribution, the components X; are independent and hence 
so are the components o;;/?X, of AX. Therefore, the components of a 


SECTION 29. LIMIT THEOREMS IN R“ 385 


normally distributed random vector are independent if and only if they are 
uncorrelated. 

If M isa j Xk matrix and Y has in R* the centered normal distribution 
with covariance matrix >, then MY has in Rİ the characteristic function 
exp(— 3(M'tyY =(M't)) = exp(— 5t'(M=M')t) (t € RÍ). Hence MY has the 
centered normal distribution in R’ with covariance matrix MM’. Thus a 
linear transformation of a normal distribution is itself normal. 

These normal distributions are special in that all the first moments vanish. 
The general normal distribution is a translation of one of these centered 
distributions. It is completely determined by its means and covariances. 


The Central Limit Theorem 


Let X, =(X,,,..., X,,) be independent random vectors all having the same 
distribution. Suppose that E[X7,]<; let the vector of means be c= 
(c,,...,¢,), where c, = E[X,,,], and let the covariance matrix be X =[g,,.], 
where o, = E[(X,,, — ¢, XX, —¢,)). Put S, SA +X,. 


Theorem 29.5. Under these assumptions, the distribution of the random 
vector (S,,—nc)/ Vn converges weakly to the centered normal distribution with 
covariance matrix %. 


Proor. Let Y=(Y,,...,Y,) be a normally distributed random vector 
with 0 means and covariance matrix 2%. For given t=(t,,...,t,), let Z, = 
yk _t(X,,,—¢,) and Z=L*_,t,Y,. By Theorem 29.4, it suffices to prove 
that n-'/?"_,Z, = Z (for arbitrary t). But this is an instant consequence of 


the Lindeberg—Lévy theorem (Theorem 27.1). g 


PROBLEMS 


29.1. A real function f on R* is everywhere upper semicontinuous (see Problem 
13.8) if for each x and e there is a 5 such that |x —y|<6 implies that 
f(y) <f(x) + €; f is lower semicontinuous if —f is upper semicontinuous. 
(a) Use condition (iii) of Theorem 29.1, Fatou’s lemma, and (21.9) to show 
that, if u„ =p and f is bounded and lower semicontinuous, then 


(29.9) lim inf /fdu,, > ffdp. 


(b) Show that, if (29.9) holds for all bounded, lower semicontinuous functions 
f, then u, >n- 
(c) Prove the analogous results for upper semicontinuous functions. 


386 


29.2. 


29.3. 


29.4. 


29.5. 


29.6. 


29.7. 


29.8. 


29.9. 


CONVERGENCE OF DISTRIBUTIONS 


(a) Show for probability measures on the line that w,, X v, = p Xv if and only 
if u, =m and v, >v. 

(b) Suppose that X,, and Y, are independent and that X and Y are indepen- 
dent. Show that, if X, = X and Y, = Y, then CX,,, Y,,) = (X,Y ) and hence that 
X,+Y,2X+Y. 

(c) Show that part (b) fails without independence. 

(d) If F, =F and G, =G, then F,*G, =F *G. Prove this by part (b) and 
also by characteristic functions. 


(a) Show that {u,„} is tight if and only if for each e there is a compact set K 
such that w,(K) > 1-—-e for all n. 

(b) Show that {u,,} is tight if and only if each of the k sequences of marginal 
distributions is tight on the line. 


Assume of (X,,, Y,,) that X, = X and Y, = c. Show that (X,, Y,) = (X,c). This 


noon 
is an example of Problem 29.2(b) where X, and Y, need not be assumed 
independent. 


Prove analogues for R* of the corollaries to Theorem 26.3. 


Suppose that f(X) and g(Y) are uncorrelated for all bounded continuous f 
and g. Show that X and Y are independent. Hint: Use characteristic func- 
tions. 


20.167 Suppose that the random vector X has a centered k-dimensional 
normal distribution whose covariance matrix has 1 as an eigenvalue of multi- 
plicity r and 0 as an eigenvalue of multiplicity k — r. Show that |X|? has the 
chi-squared distribution with r degrees of freedom. 


t Multinomial sampling. Let p,,...,p, be positive and add to 1, and let 
Z,,Z,... be independent k-dimensional random vectors such that Z, has 
with probability p; a 1 in the ith component and 0’s elsewhere. Then f,, = 
(fnir---> fak) = Eh -12m is the frequency count for a sample of size n from a 
multinomial population with cell probabilities p,. Put X ni = (fai — D)/ Vnp; 
and A =(X,,,,.--, Xp): 

(a) Show that X, has mean values 0 and covariances oj; = (ô; Pj — 
Pi Pj)/ y PiPj. 

(b) Show that the chi squared statistic D_(f,, — np,)*/np; has asymptotically 
the chi-squared distribution with k — 1 degrees of freedom. 


20.261 A theorem of Poincaré. (a) Suppose that Xn = (Xqi--0> Xan) iS 
uniformly distributed over the surface of a sphere of radius Une in R”. Fix ¢, 
and show that X,,,...,X,, are in the limit independent, each with the 
standard normal distribution, Hint: If the components of ¥, = (hii. ah 
are independent, each with the standard normal distribution, then X,, has the 
same distribution as yn Y, Age 


(b) Suppose that the distribution of X,, =(X,,,...,Xp,) is spherically sym- 
metric in the sense that X,,/|X,,| is uniformly distributed over the unit sphere. 
Assume that |X,|7/n = 1, and show that Xnir+++>X_ are asymptotically — 
independent and normal. | a 


is 


SECTION 29. LIMIT THEOREMS IN R* 387 


29.10. 


29.11. 


29.12. 


29.13. 


Let X =(X,,,-.., Xn), 2 = 1,2,..., be random vectors satisfying the mixing 
condition (27.19) with a, = O(n~*). Suppose that the sequence is stationary 
(the distribution of (X,,,..., X,,,;) is the same for all n), that E[X,,,,] = 0, and 
that the X,,, are uniformly bounded. Show that if $, =X, + --: +X, then 
Syf yn has in the limit the centered normal distribution with covariances 


oo 


ELX iX + 2a E| X inti sjo] p L E| X45 eX we]: 
j=1 


y= 


Hint: Use the Cramér—Wold device. 


T As in Example 27.6, let {Y,} be a Markov chain with finite state space 
S = {1,..., s}, say. Suppose the transition probabilities p,,. are all positive and 
the initial probabilities p, are the stationary ones. Let f„„ be the number of i 
for which 1 <i <n and Y, =u. Show that the normalized frequency count 


eA —np1,-++5fnx — Dx) 


has in the limit the centered normal distribution with covariances 


Sne PuPe t L (PL Pup) A Cee eee 
j=l j=l 


Assume that 
Gy, Op 
>= 
912 922 


is positive definite, invert it explicitly, and show that the corresponding two-di- 
mensional normal density is 


1 1 
(29.10) f(%¢7%s) = In D Z exp| = ap (F221 = 202% \X2 +0 4,x3) > 


Suppose that Z has the standard normal distribution in R'. Let u be the 
mixture with equal weights of the distributions of (Z, Z) and (Z, — Z), and let 
(X,Y) have distribution u. Prove: 


(a) Although each of X and Y is normal, they are not jointly normal. 
(b) Although X and Y are uncorrelated, they are not independent. a 


388 CONVERGENCE OF DISTRIBUTIONS 


SECTION 30. THE METHOD OF MOMENTS* 


The Moment Problem 


For some distributions the characteristic function is intractable but moments 
can nonetheless be calculated. In these cases it is sometimes possible to 
prove weak convergence of the distributions by establishing that the moments 
converge. This approach requires conditions under which a distribution is 
uniquely determined by its moments, and this is for the same reason that the 


continuity theorem for characteristic functions requires for its proof the 
uniqueness theorem. 


Theorem 30.1. Let u be a probability measure on the line ee finite 
moments a, = f? .x*u(dx) of all orders. If the power series L,a,r*/k! has a 


positive radius of convergence, then is the only probability measure with the 
MOMENTS Qi Aasaa a 


Proor. Let 8, = f? „lxl“u(dx) be the absolute moments. The first step is 
to show that 


k 
(30.1) ae Sip ow 


3 


for some positive r. By hypothesis there exists an s, 0<s<1, Such 
that as “7k | 0. Choose 0<r<s; then 2kr?*~! < 5?* for large E "See 
|x| <6 ial 


By aiii 1 z r2k-l , Bas 2k 
Ck- OE e 


for large k. Hence (30.1) holds as k goes to infinity through odd values; since 
B, =a, for k even, (30.1) follows. 
By (26.4), 


Ihan tS 
s (EDIP 


gale -È piny ale 


and therefore the characteristic function » of u satisfies 


k itx 


p(t+h) - 


aaa ae 
= ae Be 


n hk 
Fv k! 


* This section may be omitted. 


SECTION 30. THE METHOD OF MOMENTS 389 


By (26.10), the integral here is ot). By (30.1), 


= et) yy 
(30.2) pleth) m 2, erat: Rien 
k=0 i 


If v is another probability measure with moments æ, and characteristic 
function w(t), the same argument gives 


(30.3) w(tt+hy= © 


k=0 


% Kt 
Yee, lh|<r. 


Take +=0; since o*(0) = i*a, = pO) (see (26.9)), p and y agree in 
(—r,r) and hence have identical derivatives there. Taking t=r—e and 
t = —r + € in (30.2) and (30.3) shows that p and y also agree in (—2r+e,2r 
—e) and hence in (—2r,2r). But then they must by the same argument agree 
in (—3r,3r) as well, and so on.’ Thus g and w coincide, and by the 
uniqueness theorem for characteristic functions, so do u and v. a 


A probability measure satisfying the conclusion of the theorem is said to 
be determined by its moments. 


Example 30.1. For the standard normal distribution, |a,|<k!, and so the 
theorem implies that it is determined by its moments. | 


But not all measures are determined by their moments: 


Example 30.2. If N has the standard normal density, then e™ has the 
log-normal density 


a =~ (lon x)? a ge 
0 ifx<0. 


Put g(x)=f(xX1 + sin(27 log x)). If 
f x fex) sin(2m log x) dx =0, k= 01,2)... 
0 


then g, which is nonnegative, will be a probability density and will have ti 
same moments as f. But a change of variable logx=s+k reduces 


İThis process is a version of analytic continuation. 


390 CONVERGENCE OF DISTRIBUTIONS 


integral above to 


: Sa | e` /2 sin 2msds, 
27 — 00 


which vanishes because the integrand is odd. a 


Theorem 30.2. Suppose that the distribution of X is determined by its 
moments, that the X, have moments of all orders, and that lim, E| X/)= 
ELX] forr=1,2,.... Then X SX, 


PROOF. Let u„ and u be the distributions of X, and X. Since E[X7] 
converges, it is bounded, say by K. By Markov’s inequality, P[|X,„| >x] < 
K/x*, which implies that the sequence {un} is tight. 

Suppose that Hn, = v, and let Y be a random variable with distribution v. 
If u is an even integer exceeding r, the convergence and hence boundedness 
of E[X,’] implies that E[X;, ] > E[Y"], by the corollary to Theorem 25.12. 
By the hypothesis, then, E[Y"] = E[X"]—that is, v and u have the same 
moments. Since u is by hypothesis determined by its moments, v must be the 
same as u, and so w,, =p. The conclusion now follows by the corollary to 
Theorem 25.10. E 


Convergence to the log-normal distribution cannot be proved by establish- 
ing convergence of moments (take X to have density f and the X,, to have 
density g in Example 30.2). Because of Example 30.1, however, this approach 
will work for a normal limit. 


Moment Generating Functions 


Suppose that u has a moment generating function M(s) for s € l= So Sol 
Sọ > 0. By (21.22), the hypothesis of Theorem 30.1 is satisfied, and so u is 
determined by its moments, which are in turn determined by M(s) via 
(21.23). Thus pe is determined by M(s) if it exists in a neighborhood of 0.‘ The 
version of this for one-sided transforms was proved in Section 22—see 
Theorem 22.2. 

Suppose that w, and u have moment generating functions in a common 
interval [— so, 59], 59 >0, and suppose that M,(s)— M(s) in this interval. 
Since p,[(—a, a)*] <e *°*(M,(—s) + M,(5)), it follows easily that {un} is 
tight. Since M(s) determines yw, the usual argument now gives uw, >p. 


tFor another proof, see Problem 26.7. The present proof does not require the idea of analyticity. 


SECTION 30. THE METHOD OF MOMENTS 391 


Central Limit Theorem by Moments 


To understand the application of the method of moments, consider once 
again a sum Sp, =X,, + °°* tXna Where Xni: Xag, are independent and 


n 
ka 
(304) EX0 EXA] =o, = YC a 
ka 


Suppose further that for each n there is an M, such that |X,,|<M,, 
b= kaa k,, with probability 1. Finally, suppose that 


(30.5) Eat 


(30.6) =È Ear i D x) re 


where L’ extends over the u-tuples (7,,...,7,,) of positive integers satisfying 
r,+-::+r,=r and Y” extends over the u-tuples (i,,...,i,,) of distinct 
integers in the range 1 <i, <k,,. 

By independence, then, 


S, 
(30.7) z|($ :)|- E Dia aa A, (it, stent 
where 
(30.8) AA tas -corta= L grb sake 9 Eee 


and Y and X” have the same ranges as before. To prove that (30.7) converges 
to the rth moment of the standard normal distribution, it suffices to show 
that rete 


1 ifn 


30.9 lim A,(ry)-.-9F,) = 
Sati Im An Tyo +9 Fu) (a otherwise 


vies wet and r, 22, which will go to rn 
. x (r—1). And if r is odd, the terms will go to 0 withow 


İTo deduce this from the multinomial formula, res ict the 
1 <i, < ++: <i, <k, and compensate by striking out the 


Se LG 
es ar aks, = ay 7, 


392 CONVERGENCE OF DISTRIBUTIONS 


If r, = 1 for some a, then (30.9) holds because by (30.4) each summand in 
(30.8) vanishes. Suppose that r, > 2 for each a and r, > 2 for some a. Then 
r>2u, and since ELX Te] < MM “ga, it follows that Ai. .57)) 
(M, A ?“4 (2, ...,2). But this goes to 0 because (30.5) holds and because 
AI, «x vy 2) IS bounded by 1 (it increases to 1 if the sum in (30.8) is enlarged 
to include all the u-tuples (i,,...,i,,)). 

It remains only to check (30.9) for r= ::: =r,=2. As just noted, 
A,(2,...,2) is at most 1, and it differs from 1 by Ls, 2“ a2. , the sum extending 
over the (i p-s i) with at least one repeated index. Since o2 < M2, the 
terms for example with i, =i,_, sum to at most M/s, “Lo, ++: On, < 
M;s,*. Thus 1 — A,(2,. 1,2) <u2M2s ,° 0. 

This proves that the Moments (30. 7) converge to those of the normal 
distribution and hence that S,,/s, >N. 


Application to Sampling Theory 
Suppose that n numbers 
b nis nose eee 


not necessarily distinct, are associated with the elements of a population of 
size n. Suppose that these numbers are normalized by the requirement 


(30.10) pie O D dean = Ml, M,, = max|x,,,|. 
h=1 h=1 h<n 
An ordered sample X,,,,..., X,,,, is taken, where the Sample is without 


replacement. By (30.10), ELX,, J=0 and E[X7,]=1/n. Let Cie k,,/n be 
the fraction of the popnlaion sampled. If the x. nk Were independent, which 
they are not, S, =X,, t +X, would have variance s?. If k,, is small in 
comparison with n, ae effects of dependence should be small. It will be 
shown that S,/s, =N if 

k,, M 
nm 


> 0, 


(30.11) s= 


Since M,>n~' by (30.10), the second condition here in fact implies the 
third. 

The moments again have the form (30.7), but this time E[X/) AERD 4 g] 
cannot be factored as in (30.8). On the other hand, this expected CER- is T 
symmetry the same for each of the (k„), =k,(k, — 1): (k„— u + 1) choices 
of the indices i, in the sum X”. Thus 


is) 


Ag(Firss Fy) = EBL XG «+ Xi). 


The problem again is to prove (30.9). 


SECTION 30. THE METHOD OF MOMENTS 393 


The proof goes by induction on u. Now A,(r) =k, Sa'n “IEn aithn 80 that 
A,(1)=0 and A,(2)=1. If rao then xrl < M1 24 x?,, and so |A,(r)| < 
(M,,/s,)"~ 7>0 by (30.11). 

Next suppose as induction hypothesis that (30.9) ve with u — 1 in place 
of u. Since the sampling is without replacement, E(X . Xm] = AX h, ii 

x}y, /(n),, Where the summation extends over the if tuples CAjsissy Vg) 0 
distinct integers in the range 1 <h, <n. In this last sum enlarge the range by 
requiring of (h,,h),...,h,,) only that h,,...,h, be distinct, and then com- 
pensate by subtracting away the terms where h, =h,, where h, =h,, and so 
on. The result is 


r r nn Lh r r 
EL XA) Mei) = hha Xt E Mal 
Le tents E[X; eo OTe os A 
a=2 


This takes the place of the factorization made possible in (30.8) by the 
assumed independence there. It gives 


n jea = TY) al 
Atiga) = nur k | ODAN 


P E E 
C. L An Uhre onL PPA 


By the induction hypothesis the last sum is bounded, and the factor in 
front goes to 0 by (30.11). As for the first term on the right, the factor in front 
goes to 1. If r, #2, then 4,(r,)>0 and 4,(r2,...,r„) is bounded, and so 
A,(r,,..-,1,) 20. The same holds by symmetry if r, # 2 for some a other 
than 1. If r= --- =r, =2, then A,(r,)=1, and A,(r3,...,7,) > 1 by the 
induction hypothesis. 

Thus (30.9) holds in all cases, and S,/s,= N follows by the method of 
moments. 


Application to Number Theory 


Let g(m) be the number of distinct prime factors of the integer m; for 
example g(34 x 5*)=2. Since there are infinitely many primes, g(m) is 
unbounded above; for the same reason, it drops back to 1 for infinitely many 
m (for the primes and their powers). Since g fluctuates in an irregular way, it 
is natural to inquire into its average behavior. 

On the space Q of positive integers, let P, be the probability measure that 
places mass 1/n at each of 1,2,...,, so ‘that among the first n positive 


394 CONVERGENCE OF DISTRIBUTIONS 


integers the proportion that are contained in a given set A is just P,( A). The 
problem is to study P,[m: g(m) <x] for large n. 
If 5,(m) is 1 or 0 according as the prime p divides m or not, then 


(30.12) g(m) = $} ô,(m). 


Probability theory can be used to investigate this sum because under P, the 
(m) behave somewhat like independent random variables. If p,,..., P, are 
ma primes, then by the fundamental theorem of arithmetic, ô, (m)= 

= = (m)= 1—that is, each p; divides m—if and only if the product 
Piz _ divides m. The probability under P, of this is ae n~! times the 
amie of m in the range 1 <m <n that are multiples of p; *** Pua and this 
number is the integer part of n/p, --: p,. Thus 


1 n 
(30.13)  P,[m: 8,(m) = 1, i=1,...,u] = Aa TA 


for distinct p;. 
Now let X, be independent random variables (on some probability space, 
one variable for each prime p) satisfying 


1 1 
PPX, = 1] — oo BIG Olen z 
If p,,-.-,p, are distinct, then 
(30.14) P[X, = Liz 1,0] = 5. 
1 u 


For fixed p,,..., Pu, (30.13) converges to (30.14) as n > œ. Thus the behavior 
of the X, can serve as a guide to that of the 6,(m). If m <n, (30.12) is 
Ei ô (m), because no prime exceeding m can divide it. The ideat is to 
compare this sum with the corresponding sum È, . ,,X, 

This will require from number theory the elementary estimate* 


(30.15) yy 5 = loglog x + O(1). 


psx 


The mean and variance of Y,,.,,X, are rates and Erg d.. pao 


Psn 
since Lp -2 converges, each of these two sums is asymptotically loglog n. 


tCompare Problems 2.18, 5.19, and 6.16. 
+See, for example, Problem 18.17, or Harpy & WRIGHT, Chapter XXII. 


SECTION 30. THE METHOD OF MOMENTS 395 


Comparing L,.,,5,(m) with L,.,X, then leads one to conjecture the 


Erdés—Kac central limit theorem for the prime divisor function: 


Theorem 30.3. For all x, 


— loglo x 
(30.16) Blm: CE en /2 du, 


jl 
A a yloglog n T Aia aa 


Proor. The argument uses the method of moments. The first step is to 
show that (30.16) is unaffected if the range of p in (30.12) is further 
restricted. Let {a,} be a sequence going to infinity slowly enough that 


log a,, 
(30.17) loan 0 
but fast enough that 
(30.18) 5 = o( log log n)\/* 


Because of (30.15), these two requirements are met if, for example, log a,, = 
(log n)/loglog n. 
Now define 


(30.19) g,(m) = ¥ 8,(m). 


psa, 


For a function f of positive integers, let 


i M= 


Eflani Hier 


denote its expected value computed with respect to P,. By (30.13) for u = 1, 


E| E èl- Lorius bh ae 2 


p>a,, An SPN Qn SP an P 
By (30.18) and Markov’s inequality, 
P |m: |g(m) —g,(m)| > €(loglog n)*®?] > 0. 


Therefore (Theorem 25.4), (30.16) is unaffected if g,(m) is mba for 
g(m). 


396 CONVERGENCE OF DISTRIBUTIONS 


Now compare (30.19) with the corresponding sum S$, = Lop ca Api The 
mean and variance of S,, are 


Eh agai) 


p<a,, psa, 


and each is loglog n + o(loglog n)'/? by (30.18). Thus (see Example 25.8), 
(30.16) with g(m) replaced as above is equivalent to 


j g (m) — (Cy 1 x id-2 2 
30.20 Pim SS EL SS eee e “7 ai, 


n S 
n 


It therefore suffices to prove (30.20). 

Since the X, are bounded, the analysis of the moments (30.7) applies 
here. The only difference is that the summands in S, are indexed not by the 
integers k in the range k <k,, but by the primes p in the range p <a,); also, 
X, must be replaced by X,—p =l to center it. Thus the rth Honei of 
S. —c,)/s, converges to that of the normal distribution, and so (30.20) and 
(30.16) will follow by the method of moments if it is shown that as n > œ, 


=, -£,|(2=%) | >0 
Sn Sn 
for each r. 


Now E[S’] is the sum 


(30.21) E 


(30.22) E REEN a De El Xp > Xp], 


u=1 


where the range of X’ is as in (30.6) and (30.7), and X” extends over the 
u-tuples (p,,..., Pu) of distinct primes not exceeding a,. Since X, assumes 
only the values 0 and 1, from the independence of the X, and the fact that 
the p; are distinct, it follows that the summand in (30. 22) i is 


1 


(30.23) o E a erori 


P| 


By the definition (30.19), E,.[g/] is just (30.22) with the summand replaced 
by Elo. - Bil, Since 6,(m) assumes only the values 0 and 1, from (30.13) 
and the “fact that the p; are distinct, it follows that this summand is 


1 n 
E ey PN a 
(30.24) adi 5», | n ie Be ra nse 


397 


SECTION 30. THE METHOD OF MOMENTS 


But (30.23) and (30.24) differ by at most 1/n, and hence E[S;] and E,lg;] 
differ by at most the sum (30.22) with the summand replaced by 1/n. 


Therefore, 


(30.25) |E[S”] - En [sills 5 > i) sah, 


psa, 


Now 
E[(S,—e)"]= Eeee 
k=0 


and E,[(g, —c,,)”] has the analogous expansion. Comparing the two expan- 
sions term for term and applying (30.25) shows that 


(30.26) E/E Si ey dream (Sian 


- asta j 
<n HET EEE 


k=0 
Since c, <a,, and since a} /n > 0 by (30.17), (30.21) follows as required. W 


The method of proof requires passing from (30.12) to (30.19). Without 
this, the a, on the right in (30.26) would instead be n, and it would not 
follow that the difference on the left goes to 0; hence the truncation (30.19) 
for an a, small enough to satisfy (30.17). On the other hand, @, must be 
large enough to satisfy (30.18), in order that the truncation leave (30.16) 


unaffected. 


PROBLEMS 


30.1. From the central limit theorem under the assumption (30.5) get the full 
Lindeberg theorem by a truncation argument. 


30.2. For a sample of size k, with replacement from a population of size n, the 
probability of no duplicates is My «5'(1—j/n). Under the assumption 
k„/ Vn yn > 0 in addition to (30.10), dadacs the asymptotic normality of S$, by a 
reduction to the independent case. 


30.3. By adapting the proof of (21.24), show that the moment generating function of E 
u in an arbitrary interval determines yw. “ae 

30.4. 25.13 30.37 Suppose that the moment generating function of M con wA 
to that of u in some interval. Show that nw, > pw. ot 


398 


30.5. 


30.6. 


30.7. 


30.8. 


CONVERGENCE OF DISTRIBUTIONS 


Let u be a probability measure on R* for which fp«lx;| (dx) < for i= 
1,...,k and r=1,2,... . Consider the cross moments 


a(n sor) = fat ++ xfku (dx) 


for nonnegative integers r;. 
(a) Suppose for each 7 that 


(30.27) D fy f eda de) 


has a positive radius of convergence as a power series in 0. Show that m is 
determined by its moments in the sense that, if a probability measure v 
satisfies a(7r,,...,7,) = fxj! +++ xjkv(dx) for all r,,...,7,, then v coincides 
with u. 
(b) Show that a k-dimensional normal distribution is determined by its mo- 
ments. 


T Let w, and u be probability measures on R*. Suppose that for each i, 
(30.27) has a positive radius of convergence. Suppose that 


ft Bade) > fit xu (de) 
for all nonnegative integers r,,...,7r,. Show that uw, > u. 


30.57 Suppose that X and Y are bounded random variables and that X” 
and Y” are uncorrelated for m,n = 1,2,.... Show that X and Y are indepen- 


dent. 

26.17 30.67 (a) In the notation (26.32), show for A + 0 that 
2 r ai r Ay 

(30.28) M | (cos Ax) ka 77 


for even r and that the mean is 0 for odd r. It follows by the method of 
moments that cos Àx has a distribution in the sense of (25.18), and in fact of 
course the relative measure is 


1 
(30.29) pl x: cosàx <u] =1- = arccos u, — iy ad. 


(b) Suppose that A,,A2,-.. are linearly independent over the field of rationals 
in the sense that, if njà; ++: +n,,A,, =0 for integers na then m, = +>: = 
Nm = 0. Show that 


k k 
(30.30) MT (om) = TI M[ (cos a,x)"] 


for nonnegative integers r,,...,r,. 


SECTION 30. THE METHOD OF MOMENTS 399 


30.10. 


30.11. 


30.12. 


(c) Let X,,X,,... be independent and have the distribution function on the 
right in (30.29). Show that 


=P[X,+ °°: +X, <u]. 


k 
(30.31) E 2e cos A;x <u 
ga 


(d) Show that 


k 
2 1 u? 2 
30.32) lim plx: u, < V+ cos A,X gusl = —— | e-" /* db. 
C ) mf 1< Vg D osa su) = = fi 


j=1 


For a signal that is the sum of a large number of pure cosine signals with 
incommensurable frequencies, (30.32) describes the relative amount of time 
the signal is between u, and u,. 


. 6.161 From (30.16), deduce once more the Hardy-Ramanujan theorem (see 


(6.10)). 


T (a) Prove that (if P, puts probability 1/n at 1,..., n) 
loglog m — log log n 


>e|=0. 
yloglog n | 


(b) From (30.16) deduce that (see (2.35) for the notation) 


(30.33) lim P, 


m: 


— log} 
(30.34) p|m: e(a) = Joglog im <x Ps i. aie ae 


yloglog m v2 7- 


t+ Let G(m) be the number of prime factors in m with multiplicity counted. 
In the notation of Problem 5.19, G(m) = Ł „a (m). 

(a) Show for k > 1 that P,[m: a (m) —6,(m) >k] <1/p**!; hence E,la, — 
5,]<2/p’. 

(b) Show that E,[G — g] is bounded. 

(c) Deduce from (30.16) that 


Ẹ 


n 


m Cp nee a| > L_ (cena 


yloglogn 
(d) Prove for G the analogue of (30.34). 


t Prove the Hardy—Ramanujan theorem in the form 


g(m) 
loglog m 


p| m: -1fæe]=0. 


Prove this with G in place of g. 


